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^ IN THE UNITED STATES PATENT AND TRADEMARK OFFICE J^C 

In re application of David F. Bush et al. Art Unit: 1634 

it 

Serial No: 09/803,736 Examiner: Einsmann, Juliet Caroline 




Filed: 03/12/2001 



For; Plant Polymorphic Markers and Uses Thereof 



RESPONSE TO RESTRICTION REQUIREMENT AND 
REQUIREMENT FOR INFORMATION UNDER 37 CFR 1.105 



Assistant Commissioner for Patents 
Washington, DC 20231 



Sir: 



Responsive to the further restriction requirement set forth in the Patent Office 
communication dated October 8, 2002 and setting forth a one month period for response, 
Applicants provide the following. 

Applicants traverse the further undue division of their invention, e.g. which effectively 
requires applicant to file at least 50 applications to cover the fUll scope of their invention of 
claim 18 which is characterized by 100 identified polymorphic markers listed in a Maxkush 
Group. Applicants submit that such a restriction is an improper attempt by the PTO to rewrite 
claim 13 to cover another invention (which is not what applicant wishes to claim) and thus is 
effectively an improper rejection of claim 1 8 under 35 U,S,C. 121 . Imposing this type of 
effective rejection for a Maxkush claim, is improper as a matter of law. In re Weber, 580 F.2d. 
455,459, 198 U.S.P.Q. 328,332 (C.C.P.A. 1978) (holding that a rejection [of a Markush claim] 
under §121 violates the basic right of the applicant to claim his invention as he chooses"). 
Applicants reserve the right to appeal this rejection. Applicants find the rationale that a burden is 
created on USPTO computer resources running a larger search to be an unpersuasive attempt to 
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justify the improper examination. Technical incompetence in the leading patent examining 
institution will not be tolerated. The further recitation of burden on a examiner having to 'Svade 
through reams and reams of results in order to evaluate each polymorphism with regard to prior 
art" is likewise dismissed as pitiful excuse for denying examination of Applicants 7 invention. 
Wading through reams would be an archaic method when the PTO could easily construct a 
search algorithm to display results in order of novelty, e.g. ranked by percent identity, which 
would allow examination for the few sequences at the novelty interface. 

However, to comply with the restriction requirement to select a set of two polymorphisms 
to which examination of claims 18 and 19 will be limited and advance the prosecution of this 
application, Applicants hereby provisionally elect SEP ID NOs: 466799 and 471736, 

Responsive to the Requirement for Information under 37 CFR 1 . 1 05 in the above 
referenced communication from the Patent Office, Applicants provide the following: 

Re paragraph 2 of the Requirement for Information. 

Copies of two publications co-authored by one or more inventors named on the present 
application are enclosed. 

Analysis of the genome sequence of the flowering plant Arahidopsis 
thaliana, Nature 405:796-814 (December 14, 2000)- S. Rounsley el al. are authors 
of a section of this paper entitled "Comparative analysis of the genomes of A 
thaliana accessions" which appears at page 801 of the publication. No portion of 
the marker dataset was disclosed in conjunction with this publication, although a 
reference to the TAIR website was provided. 
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Jander et ah (2002) Arabidopsis Map-Based Cloning in the Post-Genome 
Era, Plant Physiology 729:440-450. A small portion of the marker dataset was 
disclosed in this publication. 

Re paragraph 3 of the Requirement for Information, 

A printout of pages available at the TAIR website is attached. The pages provide 
information on the date of initial availability of the Cereon Arabidopsis marker dataset and the 
contents and dates of availability of updates to the dataset. Applicants note that registration is 
not required for access to this summary information available at the website 
www.arabidopsis.org/'Cereon/help.htmL Furtheraiore, if the Examiner wishes to personally 
access the marker information at the above website, the below undersigned representative of 
Applicants 3 would be happy to provide a user name and password to allow the Examiner to 
access the information. In response to parts a. and b, of this request, Applicants provide the 
following summary of the data releases and their relation to the present application and its 
priority documents. Importantly, Applicants note that any information in the present application 
that was released on the TAIR website, was released after the information was filed in either the 
present application or one of its priority applications,. 

Release 1 : Released May 3, 2000 

25 5 274 SNPs 
14,041 InDels 

To the best of Applicants' knowledge, the 25,274 SNPs in Release 1. correspond to the 
25,274 SNPs presented in Applicants' priority application filed March 29, 2000 (serial number 
09/534,859). 




38-10(15493)D 

The 14,041 InDels in Release 1 are all, to the best of Applicants' knowledge, present in 
Applicants' priority application filed March 29, 2000 (serial number 09/534,859). The priority 
application also contained additional InDels larger than 100 bp that were excluded from the 
Release 1 dataset on the initial Release date as indicated on the enclosed information from the 
TAIR website. 

Release 2: Released November 16, 2000 

2843 SNPs 
1633 InDels 

To the best of Applicants' knowledge, the 2843 SNPs in Release 2 correspond to the 
2843 SNPs presented in Applicants' priority application filed October 20, 2000 (serial number 
09/692,412). 

The 1633 InDels in Release 2 are all, to the best of Applicants* knowledge, present in 
Applicants 5 priority application filed October 20, 2000 (serial number 09/692,412). This 
application also contained additional InDels larger than 100 bp that were excluded from the 
Release 2 dataset at it's initial release date as indicated on the enclosed information from the 
TAIR website. 

Release 3; Released March 21, 2001 

9226 SNPs 
2905 InDels 
42 Large InDels 

Applicants note that although the release notes indicate the release of 9227 SNPs and 43 
large InDels, the actual numbers were 9226 and 42, as determined by downloading the marker 
set from the TAIR site. 
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To the best of Applicants' knowledge, the 9226 SNPs in Release 3 are all present in the 
set of 37,343 SNPs in the present application filed on March 12, 2001. Applicants note that the 
marker set in, the present application is cumulative and contains the SNPs from all three TAIR 
releases. 

The 2905 InDels and 42 large InDels in Release 3 are all, to the best of Applicants" 
knowledge, present in the present application filed on March 12, 200 L 

In response to part c. of this request, Applicants note that information on the data format 
requested by the Examiner is provided at page 2 of the TAIR website printout provided herein. 
Applicants note that polymorphism locations are provided by chromosome and with reference to 
their location in the public BACs referenced in the tables. Oligonucleotides for amplification are 
not provided, but 20 bp of sequence directly to the left and right of the polymorphism is provided. 

Re paragraph 4 of the Requirement for Information. 

The Examiner has also requested information on the location of the elected markers in 
Table A of the application. Applicants provide herewith the printed pages of Table A that 
contain the information on markers 466799 and 471736. In Applicants 7 records, the information 
for marker 466799 is found at page 159/208 of Table A, and the infomiation for marker 471736 
is found at page 75/208 of Table A. Applicants note that a paper copy of the table as submitted 
with the application was not retained, and it is not known if the above listed page numbers 
correspond with the page numbers of the record copy of the application. For the Examiner's 
convenience, in addition to the printed pages, we have extracted the information on the elected 
markers from Table A and provide this information on a separate sheet in table format, including 
the table headings. 
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CHANGE OF CORRESPONDENCE ADDRESS 

Please direct all future correspondence to: 

Gail Wuelner, E2NA 
Monsanto Company 
800 N. Lindbergh Blvd. 
StXouis, MO 63167 

Please direct future telephone calls to : 
Thomas E. Kelley 
Registration No. 29,938 
(860) 572-5274 



Respectfully submitted, 



Thomas E. Kelley 
Registration No. 29,938 
Applicant's Attorney 
Phone: (860) 572-5274 



6 



11/08/02 FRI 17:52 FAX 18605725240 DEKALB 121009 

F Page 1 of 3 



Search | Touls | Arabldopiiis | ht o | News | Links | rTpj Stocks'* 



| „ ,, i )tA!R Database i^]^ 



CEREON ARABIDOPSIS POLYMORPHISM COLLECTION HELP 

g^^^EADMEjRfid §as^READM,El Rejg*^ [ Coiujnri_H M5 s | Notes and 

The ^accompanying tables contains potential nucleotide: polymorphisms between Columbia and 
r5?^^^^ data generated by Cereon Genomics using all the sequenced 

BACs up to the point of each, release. The clones listed in the table are ordered by chromosome and 
their location on a chromosome. 

The data were kindly submitted by Dr. Steve Rounsley. All questions referring to the data should be 
sent to: athal@cereon.com 

Release 1 : Released May 3j 2000 

SNPs and INDELs 

I™ 8 .^? 1 ^ ent 7. es < frpm 981 BACs) ' 25,274 SNPs (Single Nucleotide Polymorphism) and 
14,041 INDELs (small fnsertions/delfetions). 
Large INDELs new 

632 INDELs > tOObp found in. 341 of the 980 BACs used for Release 1 
Size range: Mih 101 bp Max 38Kb 

This collection of INDELs are larger than 100bp and were omitted from the original polymorphism 
release due to an increased level of false positives, contained within. Repetitive sequence is the 
major cause for such false positives, and the; larger the gap between two matches, the more likely 
one of the matches is due to a match against a repeated region. 

The data is being provided, with these caveats so please use this with caution. There are however 
true positives in this dataset, particularly the INDELs at the lower end of the size distribution, and 
many people had requested access to these for the purposes of specific studies such as transposon 
analysis. 



Release 2: Released on November 16, 2000 

sms from 124 BACci 

1 % increase over the previous 



SNPs and INDELs *ew 

Release 2 contains 4476 predicted polymorphisms from 124 BAG clones that have been sequenced 
by the AGI. 1 r 633 INDELs and 2843 SNPs were found. This is an 11% increase over the previous 
collection of approximately 39,000. r 
The remainder of the Columbia BACs are currently being processed and there will be a final update 
of the Cereon collection shortly after this is completed 
Large INDELs *w 

72 Large INDELs found in 43 of the 124 BACs used for Release 2, 

Also provided here are INDELs greater than 100 bp. These should be treated with caution as they 
are more likely to be the result of artifacts of the analysis method., However, many will still be true 
insertions/deletions and may therefore be of interest for certain kinds of analysis. 

Release 3: Released on March 21, 2001 

SNPs and' INDELs new 

Release 3 contains 12175 predicted polymorphisms from 396 BAG clones that have been sequenced 
by the AGI. 2905 INDELs, 9227 SNPs, and 43 Large INDELs were found. This is an 27% increase 
oyer the previous collection of approximately 45,000; The total number of polymorphisms now is 
56,670. 



http://www.arabidopsis.org/cereori/help.html 
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* Column Headings: 

VSNPJvfame; CER(Gereon)+Cereon's internal ID number e.g. CER454879 

2. Chromosome: 1 thru 5 . . 

3. BAC_Narrie; in standard- AG! format ordered by position in chromosome 

4. BAC_Accession: GenBank/EMBl/DDBJ accession, number 

5. BAC_Length: in bp 

6. Left_Coor*: The coordinate of the .base to the left of the polymorphic location. See below. 

7. Right_Goor*: The. coordinate of the base to the left of the polymorphic location. See below. 

8. SNPJTyp®; SNP (Single Nucleotide Polymorphism) or IND (Insertion/Deletion) 

9. INDjS'Zft size of insertion or deletion in Columbia. Col/Ler. e.g. -4/4* means a 4bp deletion in 
Columbia and 4/-4 meains a 4bp insertion in CoL Left blank for SNPs. 

10. SNP_Base: changed nucleotide, Col/Ler e.g. NT means A in Columbia and T in Landsberg. Left 
blank for IN DELs 

1 1 . Left^FIank: 20 bp directly to the left of the polymorphic location 

12. Right_Flank: 20bp directly to the right of the polymorphic location. See Note below. 

13. Restrlction_sites (Col):; restriction sites in Cor from the SNP/IND 

14. Restriction3ifes {Ler); restriction sites in Ler from the SNP/fND 

"Coordinates: 

For a SNP - the two coordinates flank the polymorphic base. 

For an insertion in Columbia, the two coordinates Hank the inserted sequence. 

For a deletion in Columbia, the deletion occursbetween the two coordinates listed. 



1 . 20mers are provided for locating. the correct coordinates just in case the BAC sequence changes. 
Also note that the 2Gmerswfcre supposed to be directly flanking the changed nucleotide, but the right 
flanking sequence may be off by one base for SNPs - but not for JNDs. 

Please note that the coordinates used in the datafiles fefer to the originally submitted BAC sequence. 
Many BAC sequences at GenBank have been edited by.the AGI groups in order to produce finished 
chromosome records. This involves removing overlapping regions, and flipping some clones in order 
to produce a consistent direction along the chromosome/ In addition, AGI groups may make 
alterations at any time to the submitted sequence in order to correct errors. This can also cause the 
original coordinates to be inaccurate. 

In order to access the original BAC sequence, you need to use the link provided in the current 
GenBank record, The link will look something like this: 

"COMMENT: On Dec 16, 1999 this sequence version replaced gi:5729683" 



The flanking sequence provided in. the Cereon data files attempts to provide an alternative way to 
locate the polymorphism, The 2Qmers can be used to BLAST against the Arabidopsis genome to 
identify the specified location. There are some caveats to keep in mind when doing this: 

• 1. This sequence should help find the appropriate location in the BAC of interest. It is not 
necessarily unique to the genome. It may also match other BACs in the genome, but these are 
not important for locating the polymorphism, 

• 2 If the 20mer matches more than once in the BAG of interest, try using the other 20mer as 
well and combining the results. You can ajso : use TAIR's EMMatch, which allows you to put in 
the polymorphic sequence as well as its approximate length in between the two 20mer set, 

• 3. If the 20mer;does not find a match in the BAC of interest, it could be that the editing 
mentioned above .may have moved. this location.to ai neighboring BAC. In this case, check 
your search results against the neighboring BACs. 

• 4. if it still does not match, beware that using the default BLAST parameters does not always 
work well with such a smal! query sequence. Several things can increase your chances of 
finding a match in the BAC sequence of interest 

o A. Use a smaller database. An example would be a species specific collection at NCBI, 
or the TA1R B LAST server selecting only Arabidopsis genomic sequences > 10kb 



http;//wv^,arab}dopsis,org/cereon/help T html 
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o B. Do not filter for low complexity. 

o C. Increase the mismatch penalty to -8. This should force identical matches. 
* 5, If multiple hits to the same BAG occur - do not panic, Remember, many INDELs are caused 
by a different copy: number of :a direct repeat The flanking sequence may therefore hit multiple 
places. The best bet here is topick primers several hundred bases either side of this general 
region. 



Ge neral liab ility stat ement j Privacy policy and Securit y 



^NCGR ABRC 

general comments or questions to: cjjj^or@a^ 
comments or questions about seed stocks to: ge edsto c k@ara E»dopsis.orq 
comments or questions about DNA stocks to: djnaslpckjg^ab!^ 



http://www.arabidopsis.org/cereon/Iielp.html 
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Analysis of the genome sequence of the 
flowering plant Arabidopsis thaliana 

The Arabidopsis Genome initiative* 

- Authorship of thU paper should b* cited as 'TheArabidopsh Gwone MM A full h 5t of contributors appears at the md of this paper 



fSSSSLSlEfWi 8 a « important model system for Identifying genes and determining tnair functions 
Here we report the analysis of the genomic sequence of Arabidopsis. The sequenced regions cover 115 i meaabases of the 

Mkmrt to subsequent gene loss and extensive local gene duplications, giving rise to adynamic genome enriched bv mS 
SC ^L^^^TJ^.** *• p,a ? ad " Benome'contelns 25m 9en ^S^^XmU,SSS 

W !S^£r* atai HV *? and C*enorhabdm s elegans- me other sequenced muftSlar 

eukaryotes. Arabictopsfs has many families of new proteins but also lacks several common protein families Indicating mat ne sets 
SSL 0 l prote,,,S haw Un ««*T: «P ansi «n and contraction In memreemumcefluia?^^^ 

? i XT "T* Wentn V" n 5 a wW e range of plant-specific gene functions and establishing rapid systematic ways to Identify 
genes Tor crop improvement. * 



The plant and animal kingdoms evolved independently from 
unicellular eukaryotes and represent highly contrasting life forms. 
The genome sequences of C elegum 1 and Drosophila 2 reveal that 
metazoaris share a great deal of genetic information required for 
developmental and physiological processes, but these genome 
sequences represent a limited survey of multicellular organisms, 
Flowering plants have unique organizational and physiological 
properties in addition to ancestral features conserved between 
plants and animals. The genome sequence of a plant provides a 
means for understanding the genetic basis of differences between 
plants and other eukaryotes, and provides the foundation for 
detailed functional characterization of plant genes, 

Arabidopsis thaliana has many advantages for genome analysis, 
including a short generation time, small size, large number of 
offspring, and a relatively small nuclear genome. These advantages 
promoted the growth of a scientific community that has investi- 
gated the biological processes of Arabidcpsis and has characterised 
many gcnes\ To support these activities, an international collabora- 
tion (the Arabidopsis Genome Initiative, AGI) began sequencing 
the genome in 1996, The sequences of chromosomes 2 and 4 have 
been reported 4,5 , and the accompanying Letters describe the 
sequences of chromosomes 1 (ref. <S), 3 (ref. 7) and 5 (ref. 8). 

Here we report analysis of the completed Arabidopsis genome 
sequence, including annotation of predicted genes and assignment 
of functional categories, We also describe chromosome dynamics 
and architecture, the distribution of transposable elements and 
other repeats, the extent of lateral gene transfer from organelles, 
and the comparison of the genome sequence and structure to that of 
Other Arabidopsis accessions (distinctive Jines maintained by single- 
seed descent) and plant spedes. This report is the summation of 
work by experts interested in many biological processes selected to 
nluminate plant-specific functions including defence, photomor- 
phogenesis, gene regulation, development, metabolism, transport 
and DNA repair. 

The identification of many new members of receptor families, 
cellular components for plant-specific functions, genes of bac- 
terial origin whose functions are now integrated with typical 
eukaryotic components, independent evolution of several families 
of transcription factors, and suggestions of as yet uncharacterized 
metabolic pathways are a few more highlights of this work. The 
implications of these discoveries are not only relevant for plant 
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biologists, but will also affect agricultural science, evolutionary 
biology, bioinformatics, combinatorial chemistry, functional and 
comparative genomics, and molecular medicine. 

Overview of sequencing strategy 

We used large-insert bacterial artificial chromosome (BAG), phage 
(Pi) and transformation-competent artificial chromosome (TAC) 
libraries*" 1 * as the primary substrates for sequencing. Early stages of 
genome sequencing used 79 cosmid clones. Physical maps of the 
genome of accession Columbia were assembled by restriction 
fragment 'fingerprint* analysis of BAG clones 13 , by hybridization" 
or polymerase chain reaction (PCR) 15 of sequence- tagged sites and 
by hybridization and Southern blotting 1 *. The resulting maps were 
integrated (http://nucleus/cshl.org/arabtnaps/) with the genetic 
map and provided a foundation for assembling sets of contigs 
into sequence-ready tiling paths. End sequence (http://www. 
tigr.org/tdb/at/abe/bac_cncL5earch.html) of 47,788 BAC clones 
was used to extend contigs from BAGS anchored by marker content 
and to integrate contigs. 

Ten contigs representing the chromosome arms and centromeric 
heterochromatin were assembled from 1 ,569 BAG, TAC, cosmid and 
PI clones (average insert size 100 kilobases (kb)). Twenty-two PCR 
products were amplified directly from genomic DNA and 
sequenced to link regions not covered by cloned DNA or to optimize 
the minimal tiling path. Telomere sequence was obtained from 
specific yeast artificial chromosome (YAC) and phage clones, and 
from inverse polymerase chain reaction (IPCR) products derived 
from genomic DNA. Clone fingerprints, together with BAC end 
sequences, were generally adequate for selection of clones for 
sequencing over most of the genome. In the centromeric regions, 
these physical mapping methods were supplemented with genetic 
mapping to identify contig positions and orientation 17 , 

Selected clones were sequenced on both strands and assembled 
using standard techniques. Comparison of independently derived 
sequence of overlapping regions and independent reassembly 
sequenced clones revealed accuracy rates between 99-99 and 
99.999%, Over half of the sequence differences were between 
genomic and BAC clone sequence. All available sequenced genetic 
markers were integrated into sequence assemblies to verify sequence 
contigs 4 " 8 . The total length of sequenced regions, which extend from 
either the telomeres or ribosomal DNA repeats to the 180-base-pair 
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(bp) osntromcric repeats, is 115*409*949 bp (Table 1). Estimates of 
the unsequenced cen?romeric and rDNA repeat regions measure 
roughly 10 mcgabases (Mb) t yielding a genome size of about 
125 Mb, in the range of the 50-150 Mb haploid content estimated 
by different methods 18 . In general* features such as gene density, 
expression levels and repeat distribution are very consistent across 
the five chromosomes (Fig. 1), and these are described in detail in 
reports on individual chromosomes' 1 " 1 ' and in the analysis of 
centromere, telomere and rDNA sequences. 

We. used tRNAscau-SE 1.21 (ref. 19) and manual inspection to 
identify 589 cytoplasmic transfer RNAs, 27 organelle- derived 
tRNAs and 13 pseudogenes— more than in any other genome 
sequenced to date. All 46 fcRNA families needed to decode all 
possible 61 codons were found, denning the completeness of the 
function;*! set. Several highly amplified families of tRNAs were 
found on the same strand 6 ; excluding these, each amino acid is 
decoded by 10-41 tRNAs* 

The spliceosomal RNAs (Ul, U2, U4> U5, U6) have all been 
experimentally identified in Arabidopsis, The previously identified 



Chr. 1 £9,1 Mb _ 



sequences for all RNAs were found in the genome, except for 1/5 
where the most similar counterpart was 92% identical. Between 10 
and 16 copies of each small nuclear KNA (snRNA) were found 
across all chromosomes, dispersed as singletons or in small groups. 

The small nucleolar RNAs (snoRNAs) consist of two subfamilies, 
the C/D box snoRNAs, which includes 36 Arabidopsis genes, and the 
H/ACA box snoRNAs* for which no members have been identified 
in Arabidopsis, U3 is the most numerous of the C/D box snoRNAs, 
with eight copies found in the genome. We identified forty- five 
additional C/D box snoRNAs using software (www.rna.wustl.edu/ 
snoRtSfAdb/) that detects snoRNAs that guide ribose methylation of 
ribosomal RNA, 

A combination of algorithms, all optimized with parameters 
based on known Arabidopsis gene structures* was used to define 
gene structure. We used similarities to known protein and expressed 
sequence tag (EST) sequence to refine gene models. Eighty per cent 
of the gene structures predicted by the three centres involved were 
completely consistent, 93% of ESTs matched gene models, and tess 
than 1% of ESTs matched predicted non-coding regions, indicating 
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Figure 1 Representation of the AnabJdop$i$ cttrorrtosomes. Each chrDmosoFine is 
representee* as a coloured bar. Sequenced portions are red, isomeric and centromeric 
regions are light blue, hfMerocwomatic Knobs are- shown wacK and the rDNA, repeat 
regions are magania. Trie unsequancerj telomeres 2N and AH are depicted with dashed 
lines. Telomeres are not drawn to scale, images of DAPI-stalned chromosomes were 
kindly supplied Uy P. Frans*. The frequency of features was given pseudo-colour 
assignments. Irom rod thigh density) lo deep blue (low density). Gene density ('Genes') 
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ranged from 38 per 100 Kb to 1 gene per 100kb: expressed sequence tag matches 
('ESTs') ranged from more than 200 per 100 Kb to 1 per lOOkb. Transposable element 
densities fT&O ranged trom 33 per 100 to to 1 per 100 Kb. Mitochondrial and 
chloroplast insertions (W/CP') wertt assigned black and green tick marks, respectively. 
Transfer RMAs and small nucleolar RNAs ('RNAs') were assigned black and rod links 
marks, respectively. 
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that most potential genes were identified, Tte sensitivity and 
selectivity df the gene prediction software used in this report has 
been comprehensively and independently assessed' 0 . 

The 25,498 genes predicted (Table 1) is the largest gene set 
published to date: C- elegans 1 has 19,099 genes and Drosophila 2 
13,601 genes. Arabidop$i$ and C. elegans have similar gene density, 
whereas Drosophila has a lower gene density? Arabidopste also has a 
significantly greater extent of tandem gene duplications and 
segmental duplications! which may account for its larger gene set. 

The rDNA repeat regions on chromosomes 2 and 4 were not 
sequenced because of their known repetitive structure and content. 
The centromeric regions are not completely sequenced owing to 
large blocks of monotonic repeats such as 5S rDNA and 180-bp 
repeats. The sequence continues to be extended further into 
centromeric and other regions of complex sequence. 



Characterization of the coding regions 

lb assess the similarities and differences of the Arabidopste gene 
complement compared with other sequenced eukaryotic genomes, 
we assigned functional categories to the complete set of Arabidopsis 
genes. For chromosome 4 genes and the yeast genome* predicted 
functions were previously manually assigned** 21 . All other predicted 
proteins were automatically assigned to these functional 
categories 13 * assuming that conserved sequences reflect common 
functional relationships. 

The functions of 69% of the genes were classified acxxjrding to 
sequence similarity to proteins of known function in ail organisms; 
only 9% of the genes have been characterised experimentally 
(Fig. 2a). Generally similar proportions of gene products were 
predicted to be targeted to the secretory pathway and mitochondria 
in Arabidopsis and yeast, and up to 14% of the gene products are 
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likely tcvbe targeted to the chloroplast (Table 1). The significant 
* proportion of,genes wi&h predicted functions involved in metabo- 
lism, gene regulation and defence i$ consistent with previous 
analyses 5 . Roughly 30% of the 25,498 predicted gene products, 
(Fig, 2a), comprising both plant-specific proteins and proteins with 
similarity to genes of unknown function from other organisms, 
could not be assigned to functional categories. 

To compare the functional catagories in more detail, we com- 
pared data from the complete genomes of Escherichia coli 2 \ 
Synechocystis sp. M , Saccharomyces cerevisiae 11 , C degans 1 and 
£>ro$ophtta\ and a non-redundant protein set of Homo sapiens, 
with the Arabidopsis genome data (Fig. 2b), using a stringent 
BLASTF threshold value of E < 1Q~ 30 - The proportion of 
Arabidopsis proteins having related counterparts in eukaryotic 
genomes varies by a factor of 2 to 3 depending on the functional 
category. Only S-23% of Arabidopsis proteins involved in transcrip- 
tion have related genes in other eukaryotic genomes, reflecting the 
independent evolution of many plant transcription factors. In 
contrast, 48*60% of genes involved in protein synthesis have 
counterparts in the other eukaryotic genomes, reflecting highly 



b 



Figure 2. Functional analysis of Arabidopsis genes, a , Proportion of predicted Arabidopsis 
genes in different functional categories, b, Comparison of junctional categories between 
organisms, Subsets of Ihe Arabidopsis ptoteome containing all proteins that fail into a 
common functional class were assembled. Each subset was searched against the 
complete set of translations from EsehBtictia coil, Synochocystis sp. PCCB8»" 3 T 



conserved gene functions. The relatively high proportion of 
matches between Arabidopsis and bacterial proteins in the categories 
Metabolism' and "energy 1 reflects both the acquisition of bacterial 
genes from the ancestor of the plastid and high conservation of 
sequences across all species. Finally, a comparison between uni- 
cellular and multicellular eukaryotes indicates that Arabidopsis 
genes involved in cellular communication and signal transduction 
have more counterparts in multicellular eukaryotes than in yeast, 
reflecting the need for sets of genes for communication in multi- 
cellular organisms. 

Pronounced redundancy in the Arabidopsis genome is evident in 
Segmental duplications and tandem arrays, and many other genes 
with high levels of sequence conservation are also scattered over the 
genome. Sequence similarity exceeding a BLASTP value H< 10" 20 
and extending over at least 80% of the protein length were used as 
parameters to identify protein families (table 2), A total of 11,601 
protein types were identified. Thirty- five per cent of the predicted 
proteins are unique in the genome, a-ad the proportion of proteins 
belonging to families of more than five members is substantially 
higher in Atabidcpm (37,4%) than in Drosophila (12.1%) or 



Ssccftanomycfls cbwishq, Drosophite, C. efegans and a Homo sapiens nan- redundant 
protein database. The percentage of Arabidopsis proteins in a particular subset that had a 
BLASTP match with £ s 1 '.f ™ to the respe- .I've reference genomt. i shown. ; reacts 
me meanrG "it sequence conserve an ol proteins wiihn tni: particular lurw jnal ! 
category between Arabidopsis w\6 tho respective retoronce genome, /axis, O.i = ' ft. I 
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TablQ 2 Proportion of floiwa in differ* nt orgwnlwiw pre»»itt na altrwr alngtotona or In paralogon* fam/lfes 
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C efe#an$ (24.0%). The absolute number of Arabidopsis gene 
families and singletons (types) is in the same range as the other 
multicellular eukaryotes, indicating that a proteome of 11,000- 
15 7 000 types is sufficient for a wide diversity of multicellular life. 
The^proportion of gene families with more than two members is 
considerably more pronounced in Arabidopsis than in other eukar- 
yotes (Fig. 3)- As segmental duplication is responsible for 6,303 gene 
duplications (see below), the extent of tandem gene duplications 
accounts ibi a significant proportion of the increased family size. 
These features of the Arabidopszt* and presumably other plant 
genomes, may indicate more relaxed constraints on genome size 
in plants, or a more prominent role of unequal crossing over to 
generate new gene copies. , # • 

Conserved protein domains revealed more informative differ- 
ences through INTERFRO® analysis of the predicted gene products 
from Arabidopsi$ i S, cerevisiae f C degans and Drosophifa Statisti- 
cally over-represented domains, and those that are absent from the 
Arabidopsis genome, indicate domains that ma/ have been gained or 
lost during the evolution of plants (Supplementary Information 
Table 1). Proteins containing the Pro-Pro-Arg repeat, which is 
involved in RNA stabilization and RNA processing, are over- 
represented as compared to yeast, fly and worm; 400 proteins 
containing this signature were detected in Arabidopsis compared 
with only 10 in total in yeastj Drosophila and G etegans. Protein 
kinases and associated domains, 169 proteins containing a disease 
resistance proteiin signature, and the ToML-lR (TTR) domain, a 
component of pathogen recognition molecules 2 *, are also relatively 
abundant. This suggests that pathways transducing signals in 
response to pathogens and diverse environmental cues arc more 
abundant in plants than in other organisms. 

The RING zinc finger domain is relatively over-represented in 
Arabidopsis compared with yeast, Drosophila and C elegans, whereas 
the F-box domain is over-represented as compared with yeast and 
Drosophila only. These domains are involved in targeting proteins to 
the proteasome* 7 and ubiquitmylation 2 * pathways of protein degra- 
dation, respectively. In plants many processes such as hormone and 
defence responses, light signalling, and drcadian rhythms and 
pattern formation use F-box function to direct negative regulators 




4 5 0 7 ' 6 9 ' 10 11-15 16^20 
Number ot tantfamiy mj^sated genre per gone array 
RBure 3 Distribution of tandemly repeated gene arrays in the Ambiifopsfs genome, 
Tandemly repeated gene arrays were Identified using the BLASTP program with a 
threshold of E< 1 0"*°, One unrelated gene amory cluster members was toleratgd- The 
histogram gives the numtier off dusters tn the genome containing 2 to nsftntter gene unite 
in tandem. 



to the ubiquitin degradation pathway. This mode of regulation 
appears to be more prevalent in plants and may account for a higher 
representation of the F box than in Drosophila and for the over- 
representation of the ubiquitin domain in the Arabidopsis genome. 
RING finger domain proteins in general have a role in ubiquitin 
protein ligases, indicating that proteasome-mediated degradation is 
a more widespread mode of regulation in plants than in other 
kingdoms. 

Most functions identified by protein domains are conserved in 
Similar proportions in the Arabidopsis> S. cerevisiae, Drosophila and 
Celegans genomes, pointing to many ubiquitous eukaryotic path- 
ways* These are illustrated by comparing the list of human disease 
genes 29 to the complete Arabidopsis gene sec using BLASTR Out 
of 289 human disease genes, 139 (48%) had hits in Arabidopsis 
using a BLASTP threshold JB < 10" 10 , Sixty-nine (24%) exceeded an 
B < 10" 40 threshold, and 26 (93%) had scores better than E < lG" 100 
(Table 3). There are at least 17 human disease genes more similar to 
Arabidopns genes than yeast, Drosophila or C elegans genes 
(Tabic 3). 

This analysis shows that, although numerous families of proteins 
are shared between all eukaryotes, plants contain roughly 150 
unique protein families. These include transcription factors, struc- 
tural proteins* enzymes and proteins of unknown function. Mem- 
bers of the families of genes common to all eukaryotes have 
undergone substantial increases or decreases in their size in 
Arabidopsis. Finally, the transfer of a relatively small number of 
cyanobacteria- related genes from a putative endosymbiotic ances- 
tor of the plastid has added to the diversity of protein structures 
found in plants. 

Genome organization and duplication 

The Arabidopsis genome sequence provides a complete view of 
chromosomal organization and clues to its evolutionary history- 
Gene families organized in tandem arrays of two or more units have 
been described in C degaw 1 and Drosophila 2 . Analysis of the 
Arabidopsis genome revealed 1,528 tandem arrays containing 
4,140 individual genes, with arrays ranging up to 23 adjacent 
members (Fig. 3). Thus 17% of all genes of Arabidopsis are arranged 
in tandem arrays* 

Large segmental duplications were identified either by directly 
aligning chromosomal sequences or by aligning proteins and 
searching for tracts of conserved gene order. All five chromosomes 
were aligned to each other in both orientations using MUMmer 50 , 
and the results were filtered to identify all segments at least 1 ,000 bp 
in length with at least 50% identity (Supplementary Information 
Fig, 1). These revealed 24 large duplicated segments of lOOkb or 
larger, comprising 65,6 Mb or 58% of the genome. The only 
duplicated segment in the centromcric regions was a 375-kb 
segment on chromosome 4. Many duplications appeal to have 
undergone further shuffling, sucb as local inversions after the 
duplication event. 

We used TBLASTX 5 to identify collincar clusters of genes residing 
in large duplicated chromosomal segments. The duplicated regions 
encompass 67. 9 Mb> 60% of the genome, slightly more than was 
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found in the t>NA-based alignment (Fig. 4), and these data extend 
earlier findings' 1 ' 5 ' 31 . The extent of sequence conservation of the 
duplicated geiies varies greatly, with 6,303 (37%) of the 17^193 genes 
in the segments classified as highly conserved (£< 10~ 30 ) and a 
further 1 ,705 (10%) showing less significant similarity up to 
E< 10" s . The proportion of homologous genes in each duplicated 
segment also varies widely, between 20% and 47% for the highly 
conserved ch&s of genes. In many ta$cs k the number of copies of a 
gene and its counterpart differ (for example, one copy on one 
chromosome and multiple copies on the other; see Supplementary 
Information Fig. 2); this could be due to either tandem duplication 
or gene loss after the segmental duplication. 

What does the duplication in the Arahidopsis genome tell us 
about the ancestry of the species? Polyploidy occurs widely in plants 
and is proposed to be a key factor in plant evolution 32 . As the 
majority of the Arabidopsis genome is represented in duplicated 
(but not triplicated) segments, it appears most likely that 
Arabidopsis^ like maize, had a tettaploid ancestor 33 . A comparative 
sequence analysis of Arabidopsis and tomato estimated that a 
duplication occurred ~112Myr ago to form a tetraploid 34 * The 
degrees of conservation of the duplicated segments might be due to 
divergence from an ancestral autotetraploid fotm } or might reflect 
differences present in an allotetraploid ancestor. It is also possible, 
however* that several independent segmental duplication events 
took place instead of tetraploid formation afld stabilisation. 

The diploid genetics oi Arabidopsis and the extensive divergence 
of the duplicated segments have masked its evolutionary history. 
The determination of Arabidopsis gene functions mufit therefore be 
pursued with the potential for functional redundancy taken into 
account. The long period of time over which genome stabilization 
has occurred has, however, provided ample opportunity for the 
divergence of the functions of genes that arose from duplications. 

Comparative analysis of Arabidopsis accessions 

Comparing the multiple accessions of Arabidopsis allows us to 
identify commonly occurring changes in genome rmcrosttucture, 
Tt also enables the development of new molecular markers for 
genetic mapping. High rates of polymorphism between 
Arabidopsis accessions, including both DMA sequence and copy 
number of tandem arrays, are prevalent at loci involved in disease 
resistance 55 . This has been observed for other plant species, and such 
loci are thought to serve as templates for illegitimate recombination 
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to create new pathogen response specificities^ We carried out a 
comparative analysis between 82 Mb of the genome sequence of 
Arabidopsis accession Columbia (Col-0) and 92,1Mb of non- 
redundant low-pass (twofold redundant) sequence data of the 
genomic DNA of accession Landsberg erecta (Lor). Wc identified 
two classes of differences between the sequences: single nucleotide 
polymorphisms (SNPs)> and insertion-deletions (InDels), As we 
used high stringency criteria, our results represent a minimum 
estimate of numbers of polymorphisms between the two genomes. 

In total, we detected 25,274 SNPs, representing an average density 
of 1 SNP per 3.3 kb. Transitions (A/T-G/C) represented 52.1% of 
the SNPs, and transversions accounted for the remainder: 17.3% for 
A/T-T/A, 22.7% for AJT-C/G and 7.9% for C/G-G/C lift total, wc 
detected 14*570 InDels at an average Spacing of 6.1 kb, They ranged 
from 2 bp to over 38 kllobase-pairs, although 95% were smaller than 
50 bp. Only 10% of the InDels were co-located with simple sequence 
repeats identified with the program Sputnik. An analysis of 416 
relative insertions greater than 250 bp in Col»0 showed that 30% 
matched transposon^related proteins* indicating that a substantial 
proportion of the large InDels are the result of transposon insertion 
or excision. Many InDels contained entire active genes not related to 
transposons. Half of such genes absent from corresponding posi- 
tions in the Col-0 sequence were found elsewhere on the genome of 
Ler. This indicates that genes have been transferred to new genomic 
locations. 

Gene structures are often affected by small InDels and SNPs. The 
positions of SNPs and InDels were mapped relative to 87,427 exons 
and 70,379 introns annotated in the Col-0 sequence. SHPft wete 
found in exons, introns and intergenic regions at frequencies of 1 
SNP per 3.1, 2.2 and 3.5 kb, respectively. The frequencies for InDels 
were 1 per 9.3, 3.1 and 4.3 kb t respectively. Polymorphisms were 
detected in 7% of exons, and alter the spliced sequences of 25% of 
the predicted genes. For InDels in exons, insertion lengths divisible 
by three arc prevalent for small insertions (< 50 bp), indicating that 
many proteins can withstand small insertions or deletions of amino 
acids without loss of function. 

Our analyses show that sequence polymorphisms between acces- 
sions of Arabidopsis are common, and that they occur in both 
coding and non-coding regions. We found evidence for the reloca- 
tion of genes in the genome, and for changes in the complement of 
transposabk elements, The data presented here are available at 
http://www.arabidopsis.org/cereon/. 
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Figure 4 Segmental^ duplicated regions in the Arabidops® genome, individual 
chromosomes are depicted as horizontal grey oars (with chromosome 1 at tho top), 
centromeres are marked black. Coloured bands connect oorre6|30ncfln§ duplicated 



segments. Similarity between the rDMA repeats are excluded, Duplicated segments in 
reversed orientation are connected with twisted coloured Daods. Tiie scale is 
inmegabases. 
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Comparison of Arabidopsis and other plant genera 

Comparative genetic mapping can reveal extensive conservation of 
genome organization between dosely related species* 7,311 . The com- 
parative analysis of plant genome microstnicture reveals much 
about the evolution of plant genomes and provides unprecedented 
opportunities for crop improvement by establishing the detailed 
structures of> and relationships between) the genomes of crops and 
Arabidopsis. 

The lineages leading to Arabidopsis and Capsdla rubella (shepherd 1 $ 
purse) diverged between 6.2 and &8 Myr ago, and the gene content 
and genome organization of C rubella is very similar to that of 
Arabidopsis** 7 including the large-scale duplications. Alignment of 
Arabidopsis complementary DNA and EST sequences with genomic 
DNA sequences of Arabidopsis and C rubella showed conservation 
of exon length and intron positions. Coding sequences predicted 
from these alignments differed from the annotated Arabidopsis gene 
sequences in two out of five cases. 

The ancestral lineages of Arabidopsis and the Brasska (cabbage 
and mustard) genera diverged 12,2-19.2 Myr ago 40 . Brassica genes 
show a high level of nucleotide conservation with their Arabidopsis 
orthologues, typically more than $5% in coding regions 4 ^. The 
structure of Brassica genomes resembles that of Arabidopsis, but 
with extensive triplication and rearrangement 41 , and extensive 
divergence of microstructure (Supplementary Information Fig. 3). 
The divergence between the genomes of Arabidopsis and Brasska 
oleracea is in striking contrast to that observed between Arabidopsis 
and C. rubella, although the time since divergence is only twofold 
greater. This accelerated rate of change in triplicated segments of the 
genome of 25. okracea indicates that polyploidy fosters rapid 
chromosomal evolution. 

The Arabidopsis and tomato Uncages diverged roughly 150 Myr 
ago, and comparative sequence analysis of segments of their 
genomes has revealed complex relationships 3 *. Four regions of the 
Arabidopsis genome are related to each other and to one region in 
the tomato genome, suggesting that two rounds of duplication may 



have occurred in the Arabidopsis lineage. The eactensive duplication 
described here supports the proposal that the more recent of these 
duplications* estimated to have occurred ~ 112 Myr ago, was the 
result of a polyploidization event. The lineages of Arabidopsis and 
rice diverged —200 Myr ago 42 . Three regions of the genome of 
Arabidopsis were related to each other and to one region in the rice 
genome, providing further evidence for multiple duplication 
events 43 ' 44 . 

The frequent occurrence of tandem gene duplications and the 
apparent deletion of single genes* Or small groups of adjacent genes, 
from duplicated regions suggests that unequal crossing over may be 
a key mechanism affecting the evolution of plant genome micro- 
structure, However the segmental inversions and gene transloca- 
tions in the genomes of both rice and B. oleracea that arc not found 
in Arabidopsis indicate that additional mechanisms may be 
involved 10 . 

Integration of the three genomes in tfie plant cell 

The three genomes in the plant cell—those of the nucleus* the 
plastids (chloroplasts) and the mitochondria — differ markedly in 
gene number, organization and stability, Plastid genes are densely 
packed in an order highly conserved in all plants* 5 , whereas 
mitochondrial genes 46 are widely dispersed and subjected to exten- 
sive recombination. 

Organellar genomes are remnants of independent organisms — 
plastids are derived from the cyanobactcrial lineage and mitochon- 
dria from the ot-Proteobacteria_ The remaining genes in plastids 
include those that encode subunits of the photosystem and the 
electron transport chain, whereas the genes in mitochondria encode 
essential subunits of the respiratory chain- Both Organelles contain 
sets of specific membrane proteins that, together with housekeeping 
proteins, account for 61% of the genes in the chloroplast and 8$ % 
in the mitochondrion (Table 4). The balances are involved in 
transcription and translation* 

The number of proteins encoded in the nucleus likely to be found 
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Denier-White, SEROA . 
Xeroderma Pigmentosum, P-XPD 
Xeroderma pigment, B-ERCC5 
Myperinsulinlsm, ABCC8 
Renal tubul. acidosis, A7P6B1 
HDL deficiency 1 > ABCA1 
VMson, ATP7B 

Immunodeficiency, DNA ugasa 1 
Stargardt'a. ABCA4 
Ataxia telangiectesJa. ATM 
Nlamann-PlcK. NPC1 
Menkes, ATP7A 
HNPCC*. MUH1 
Deafness, hereditary, MYOl S 
f am, cardiac myopathy, MYH7 
Xeroderma Pigmentosum, F-XPF 
C56PD deficiency, G0PD 
Qy$tic fibroid, ASCC7 
Glycerol kinase deflc r QK 
HNPCC, MSH3 
HNPCC, PMS2 
ZoJIweaer, PEX1 
HNPOO, M$H6 
Bloom, BLM 

Finnish amyloidosis, GSN 
OiectiaK-Htgashi, CHS1 
Xeroderma Pigmentosum, Q-?<PCS 
Bare lymphocyte, APCB3 
Citrullinemia, type I, ASS 
Coffin-LQWty, 
Keratoderma, KRT9 
Myotonic dystrophy, DM1 
Banter's, SLC12A1 
Dents, CLCN5 
OtephapoUSl, DAPH1 
AKT2 
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1.9X10- 73 
6.9* 10" 7a 



T27I1J6 
F16K9_19 
ATSg4l360 
F20D2a_11 
AT4g33510 
At2g417O0 
AT5Q44790 
T6D22J0 
At2g417O0 
AT3g4B190 

f7?22_1 
F2K11_17 
AT4g09140 
At2g31900 
Tl 011^1 4 
AT50411SO 
AT5g40760 
AT3982700 
T21F11J21 
AT4Q25540 
AT4g024ea 
AT5Q08470 
AT4g(?3070 
T19D16_15 
ATSgS?3£0 

F1QQ3.J1 
AT3g2S030 
AT5Q3904Q 
AT4S24830 
AT3gQ8720 
AT3Q17050 

At2g20470 

F£6Q16_0 
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Putative calcium ATPase 
Putative DNA repair protein 
DNA excision repair croae-cQmpfementina protein 
Multidrug resistance protein 
Probabrfe -transport! ne ATPasfc 

Putative ABC transporter 
ATP-dependent copper transporter 
DNA ligaaa 
Putative ASC transporter 1 
Ataxia telangiectasia mutated protein AtATM 
Nigmann-Pick C disease protein-liko protein 
AtP-d<?p0ndent copper tranaporter, putative 
WLH1 protein 
Putative unconventional myosin 
Putative myosin heavy chain 
Repair endanucleesa (gb[AAf=Q1 274, 1 ) 
GlUCO$G-6-phosphatB dehydrogenase 
A9C transporter-like protein 
Putative glycerol kinase 
PutatWe DMA mt^matah repair protein 
No title 
Fut&tfve protein 
G/T DNA miematcn repair onryrna 
dna hClicass tsoiog 
Vlliln 

Putative transport protein 
Hypothetical protein 
ABC trariBportei'-Hke protein 
AfglnfrvssucclnAta synthase-iike protein 
putetivo ribosomal-protaln S6 kinase (ATPK19> 
unknown protein 
Putative protein kinase 
Catron-chforide co'transporter, putative 
CLC-d chloride channel protein 
Hypothetical protein 
Putative rtbaaomai-proteln S6 K!r>ei$d ^ATFKG) 
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in organelles was predicted using default settings on Target? 
(Table I), Many nuclear gene products that are targeted to cither 
(or 1 both) organelles were originally encoded in the organdie 
genomes and were transferred to the nuclear genome during 
evolutionary history, A large number also appear to be of eukaryotic 
origin, with functions such as protein import components, which 
were probably not required by the free-living ancestors of the 
endosymbiomX 

To identify nuclear genes of possible organellar ancestry, we 
compared all predicted Arabtdopsis proteins to all proteins from 
completed genomes including those from plastids and mitochon- 
dria (Supplementary Information Table 2). This search identified 
proteins encoded by the Arabidopsis nuclear genome that ate most 
similar to proteins encoded by odier species' organelle genomes ( 14 
mitochondrial and 44 plastid). These represent organeUe-to- 
nuclear gene transfers that have occurred sometime after the 
divergence of the organelle-containing lineages' 17 . There is a great 
excess of nuclear encoded proteins most similar to proteins from the 
cyanobaeteria Synechocystis (Supplementary Information Fig. 4; 
506 Arabidopsis predicted proteins matching 404 different Synecho* 
cystis proteins* providing further evidence of a genome duplica- 
tion). These 806 Arabidopsis predicted proteins, and many others of 
greatly diverse function, arc possibly of plastid descent. Through 
searches against proteins from other cyanobaeteria (with incom- 
pletely sequenced genomes), we identified 69 additional genes of 
possibly plastid descent. Only 25% of these putatively plastid- 
derived proteins displayed a target peptide predicted by Target?, 
indicating potential cytoplasmic functions for most of these genes. 

The difference between predicted plastid-targeted and predicted 
plastid-derived genes indicates that there is a probable overestima- 
tiort by ab initio targeting prediction methods and a lack of 
resolution with respect to destination organelles, the possible 
extensive divergence of some endosymbiont-derived genes in the 
nuclear genome, the co-opting of nuclear genes for targeting to 
organelles, and cytoplasmic functions for cyanobaderia-derived 
proteins. Clearly more refined tools and extensive experimentation 
is required to catalogue plastid proteins. 

The transfer of genes between genomes still continues (Supple- 
mentary Information Table 3). Plastid DNA Insertions in the 
nucleus (17 insertions totalling 11 kb) contain full-length genes 
encoding proteins or tRNAs, fragments of genes and an intron as 
well as intergenic regions. Subsequent reshuffling in the nucleus is 
illustrated by the atpH gene, which was originally transferred 
completely, but is now in two pieces separated by 2 kb. The 13 
small mitochondrial DNA insertions total 7kb in addition to the 
large insertion doss to the centromere of chromosome 2 (ref, 3). 
The high level of recombination in the mitochondrial genome may 
account for these events. 

Transposable elements 

Transposons* which were originally identified in maize by Barbara 
McClintock* have been found in all eukaryotes and prokaryotes. A 
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subset of transposons replicate through an RNA intermediate (class 
1), whereas others move directly through a DMA form (class II), 
Transposons are further classified by similarity either between their 
mobility genes or between their terminal and/or internal motifs, as 
well as by the size and sequence of their target site. Internally deleted 
elements can often be mobilised in trans by fully functional 
elements, 

Transposons in Arabidopsis account for at least 10% of the 
genome, or about one*fifth of the intcigenic DNA, The 
Arabtdopsis genome has a wealth of class I (2,109) and II 
(2,203) elements, including several new groups (1,209 element^ 
Supplementary Information Table 4). Mobile histories for many 
elements were obtained by identifying regions of the genome with 
significant similarity to 'empty* target sites (RESites) thus providing 
high-resolution information concerning the termini and target site 
duplications 43 ' 45 . These regions were readily detected because of the 
propensity of transposons to integrate into repeats and because of 
duplications in the genome sequence. In several cases, genes appear 
to have been included as 'passengers 1 . in transposable units**. In 
some cases* shared sequence similarity, coding capacity and RESites 
attest to recent activity of transposable elements in the Arabtdopsis 
genome. Only about 4% of the complete elements identified 
correspond to an EST, however, suggesting that most are not 
transcribed. 

Transposable elements found in many other plant genomes are 
well represented in ArabidopstSy including capia- andg>psy-likc long 
terminal repeat (LTR) retfotransposons, long interspersal nuclear 
elements (LINEs); short interspersed nuclear elements (SINEs), 
kobofAztivatorfTzmZ (MT)-like elements, CACTA-like elements 
and miniature inverted-repeat transposable elements (MITES), 
Although usually small in size, some larger Tourist-Vke MITEs 
contain open reading frames (ORFs) with similarity to the trans- 
posases of bacterial insertion sequences 46 . Basho and many Mutator* 
like elements (MULEs)* first discovered in the Ambidopsis sequence> 
represent structurally unique transposons* 8 " 50 , Basho elements have 
a target site preference for mononucleotide *K and wide distribution 
among plants*^ 1 , MULEs exhibit a high level of sequence diversity 
and members of most groups lack long terminal inverted repeats 
(TlRs). Phylogenetic analysis of the Arabtdopsis MURA-like trans- 
posases suggests that TlR-containing MULEs are more closely 
related to one another than to MULEs lacking TIRs* 9 S2 . 

For many plants with large genomes, class I retrotransposons 
contribute most of the nucleotide content". In the small Arabtdopsis 
genome, class I elements are less abundant and primarily occupy the 
centromere. In contrast, Basho elements and class 11 transposona 
such as MITEs and MULEs predominate on the periphery of 
periccntromeric domains (Fig. 5). In class II transposons, MULEs 
and CACTA elements are clustered near centromeres and hetero- 
chromatic knobs, whereas MITEs and HAT elements have a less 
pronounced bias. The distribution pattern of transposable elements 
observed in Arabidopsis may reflect different types of pericentro- 
mcric heterochromatin regions and may be similar to those found 
in animals. 

Numerous- centromeric satellite repeats are located between 
each chromosome arm and have not yet been sequenced, but 
are represented in part by unanchored BAC contigs (R. Martienssen 
and M, Marra, unpublished data), End sequence suggests that these 
domains contain many more class I than class II elements, con- 
sistent with the distribution reported here (K, Lemcke and R. 
Martienssen, unpublished data). Wc do not know the significance 
of the apparent paucity of elements in telomeric regions and in the 
region flanking the rPNA repeats on chromosome 4 (but not on 
chromosome 2). 

Overall, transposon-rich regions are relatively gene-poor and 
have lower rates of recombination and EST matches, indicating a 
correlation between low gene expression, high transposon density 
and low recombination 5 \ The role of transposons in genome 
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organization and chromosome structure can now be addressed in a 
mddel organism known to undergo DMA methylation and other 
forms >of chromatin modification thought to * regulate 
transposition^, 

rDNA, telomeres and centromeres 

Nucleolar organizers (NORs) contain arrays of unit repeats encod- 
ing the 1SS S 5,8S and 25S ribosomal RNA genes and arc transcribed 
by UNA polymerase I. Together with 5S RNA, which is transcribed 
by RNA polymerase XII* these rRNAs form the structural and 
catalytic cores of cytoplasmic ribosomes. In Arabidopsis* the 
NOR5 juxtapose the telomeres of chromosomes 2 and 4, and 
comprise uninterrupted 18S, 5.8S and 25S units all orientated on 
the chromosomes in the same direction 54 . In contrast, the 5$ rRNA 
genes are localized to heterogeneous arrays in the centromeric 
regions of chromosomes 3, 4 and 5 (ref. 55; and Fig. 6). Both 
NORs are roughly 3.5-4,0 rnegabase-pairs and comprise —350-400 
highly methylated rRNA gene units, each ~10kb (re£ 54). The 
sequence between the euchromatic arms and NORs has been 
determined. Elsewhere in the genome, only one other 18S, 5SS t 
25$ rRNA gene unit was identified in centromere 3. Although minor 
variations in sequence length and composition occur in the NOR 
repeats, these variants are highly clustered, supporting a model of 
sequence maintenance through concerted evolution* 5 . 

Arabidopsis telomeres are composed of CCCTAAA repeats and 
average — 2-3 kb (ref, 56). For TEL4N (telomere 4 North), con- 
sensus repeats are adjacent to the NOR; the remaining telomeres are 
typically separated from coding sequences by repetitive subtelo- 
meric regions measuring less than 4kb. Imperfect telomere-like 
arrays of up to 24 kb are found elsewhere in the genome, particularly 
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near centromeres. These arrays might affect the expression of nearby 
genes and may have resulted from ancient rearrangements, such as 
inversions of the chromosome arms. 

Centromere DNA mediates chromosome attachment to the 
meiotic and mitotic spindles and often forms dense hetcrochroma- 
tin. Genetic mapping of the regions that confer centromere function 
provided the markers necessary to precisely place BAC clones at 
individual centromeres 17 ; 69 clones were targeted for sequencing, 
resulting in over 5 Mb of DNA sequence from the centromeric 
regions. The unsequenced regions of centromeres are composed 
primarily of long, homogeneous arrays that were characterized 
previously with physical 57 and genetic mapping 17 and contain over 
3 Mb of repetitive arrays* including the 180-bp repeats and 5S 
rDNA 51 (Fig- 6). 

Arabidopsis centromeres, like those of many higher eukaryotes, 
contain numerous repetitive elements including retroelement$ t 
transposons, microsateHUcs and middle repetitive DNA 17 . These 
repeats are rare in the euchromatic arms and often most abundant 
in pericentxomeric DNA. The repeats, affinity for DNA-binding 
dyes, dense methylation patterns and inhibition of homologous 
recombination indicate that the centromeric regions arc highly 
hctcrochrornatic, and such regions are generally viewed as very 
poor environments for gene expression. Unexpectedly, we found at 
least 47 expressed genes encoded in the genetically defined centro- 
meres of Arabidopsis (http://preuss.bsd.uchicago.edu/arabidopsis. 
genomeJatml). In several cases, these genes reside on islands of 
unique sequence flanked by repetitive arrays, such as l&O-bp or 55 
rDNA repeats. Among the genes encoded in the centromeres are 
members of 11 of the \6 functional categories that comprise the 
proteomc. The centromeres are not subject to recombination; 
consequently, genes residing in these regions probably exhibit 
unique patterns of molecular evolution. 

The function of higher eukaryotic centromeres may be specified 
by proteins mat bind to centromere DNA, by epigenetic 
modifications, or by secondary or higher order structures. A 
pairwisc comparison of the non-repetitive portions of all five 
centromeres showed they share limited ( 1 -7%) sequence similarity. 
Forty-one fornilies of small, conserved centromere sequences 
(AtCCSi see hnp;//prcuss.bsd.uchicago.edu/arabidopsis.genome, 
html) are enriched in the centromeric and pericentromeric regions 
and differ from sequences round in the centromeres of other 
euk&ryotes. Molecular and genetic assays will be required to 
determine whether these conserved motifs nucleate Arabidopsis 
centromere activity. Apart from the AtCCS sequences* most cen- 
tromere DNA is not shared between chromosomes, complicating 
efforts to derive clear evolutionary relationships. In contrast, genetic 
and cytologic*! assays indicate that homologous centromeres arc 
highly conserved among Arabidopsis accessions, albeit subject to 
rearrangements such as inversions to form knobs 5 ^' 5 * and 
insertions 4 , further investigation of centromere DNA promises to 
yield information on the evolutionary forces that act in regions of 
limited recombination, as well as an improved understanding of the 
role of DNA sequence patterns in chromosome segregation- 
Membrane transport 

Transporters in the plasma and intracellular membranes of 
Arabidopsis are responsible for the acquisition, redistribution and 
compartmentalization of organic nutrients and inorganic ion$> as 
well as for the efflux of toxic compounds and metabolic end 
products, energy and signal transduction, and turgor generation. 
Previous genomic analyses of membrane transport systems in 
5, cerevisiae and C. etegatts led to the identification of over 100 
distinct families of membrane transporters 60,61 . We compared 
membrane transport processes between Arabidopsis, animals, 
fungi and prokaryotes, and identified over 600 predicted membrane 
transport systems in Arabidopsis (http;//www-biologyucsd.edu/ 
— ipaulscn/transport/), a similar number to that of C ekgans 
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(~700 transporters) and over twofold greater than either 

k , 5, cerevfciae or R. coii (—300 transporters). 

We compared the * transporter complement of Arahidopsis, 
C elegatts and S, cerevfaiae in terms of energy coupling mechanisms 
(Fig. 7a) T Unlike animals, which use a sodium ion P-typc ATPase 
pump to generate an electrochemical gradient across the plasma 
membrane, plants and fungi use a proton P-type ATPase pump to 
form a large membrane potential (-250 mV} 62 . Consequently, plant 
secondary transporters are typically coupled to protons rather than 
to sodium 61 . Compared with C elegans, Arahidopsis has a surpris- 
ingly high percentage of primary ATP-dependent transporters ( 12% 
and 21% of transporters, respectively}, reflecting increased timbers 
of P-type ATPasea involved in metal ion transport and ABC ATPases 
proposed to be involved in sequestering unusual metabolites and 
drugs in Che vacuole or in other intracellular compartments. These 
processes may be necessary for pathogen defence and nutrient 

• storage. 

About 15% of the transporters in Arahidopsis are channel pro- 
teins, rive times more than in any single-celled organism but half the 
number m C. elegans (Fig, 7b), Almost half of the Arahidopsis 
channel proteins are aquaporins, and Ambidapsis has 10-fold more 
Mfamily major intrinsic protein (MIP) family water channels than 
any other sequenced organism. This abundance emphasizes the 
importance of hydraulics in a wide range of plant processes, 
including sugar and nutrient transport into and out of the vascu- 
lature, opening of stomat*l apertures, cell elongation and epinastk 
movements of leaves and stems. Although Arahidopsis has a diverse 
range of metal cation transporters* C. elegans has more, many of 
which lunction in cell-cell signalling and nerve signal transduction. 
Amhidopsis also possesses transporters for inorganic anions such as 
phosphate t sulphate, nitrate and chloride, as well as for metal cation 
channels that serve in signal transduction or cell homeostasis. 
Compared with other sequenced organisms, Arahidopsis has 10- 
fold more predicted peptide transporters, primarily of the proton- 
dependent oligopeptide transport (POTO family, emphasizing the 
importance of peptide transport Or indicating that there is broader 
substrate specificity than previously realized. There are nearly 1,000 
Arahidopsis genes encoding Ser/Thr protein kinases, suggesting that 
peptides may have an important role in plant signalling*". 

Virtually no transporters for carboxylates, such as lactate and 
pyruvate, wece identified in the ArahidopsU genome. About 12% of 
the transporters were predicted to be sugar transporters, mostly 
consisting of paralogises of the MFS family of hexose transporters. 
Notably, S. cerevisiae, C elegatts and most prokaryotes use 
APC family transporters as their principle means of ammo-acid 



transport, but Arahidopsis appears to rely primarily on the AAAP 
family of ammo-acid and auxin transporters. More than 10% of the 
transporters in Arahidopsis are homologous to drug efflux pumps; 
these probably represent transporters involved in the sequestration 
into vacuoles of xcnobiotlcs, secondary metabolites* and breakdown 
products of chlorophyll. 

Surprisingly> Arahidopsis has close homologies of the human 
ABC TAP transporters of antigenic peptides for presentation to the 
major histocompatibility complex (MHC), In Arahidopsis, these 
transporters may be involved in peptide erBux, or more specula- 
tively, .in some form of cell- recognition response. Arahidopsis- also 
has 10-fold more members of the multi-drug and toxin extrusion 
(MATE) family than any othe* sequenced organism; in bacteria, 
these transporters function as drug efflux pumps. Curiously, 
Arahidopsis has several homologues of the Dfosophila RND trans- 
porter family Patched protein, which functions in segment polarity^ 
and more than ten homologues of the Drosophita ABC family eye 
pigment transporters. In plants, these are presumably involved in 
intracellular sequestration of secondary metabolites. 

DNA repair and recombination 

DNA repair and recombination pathways have many functions in 
different species such as maintaining genomic integrity^ regulating 
mutation rates, chromosome segregation and recombination* 
genetic exchange within and between populations, and immune 
system development. Comparing the Arahidopsis genome with 
other species 63 indicates that Arahidopsis has a similar set of DNA 
repair and recombination (RAR) genes to most other eukaryotes. 
The pathways represented include photoreactivation, DNA ligation, 
non-homologous end joining* base excision repair, mismatch 
excision repair, nucleotide excision repair and many aspects of 
DNA recombination (Supplementary Information Table 5). The 
Arahidopsis RAR genes include homologues of many DNA repair 
genes that are defective in different human diseases (for example, 
hereditary breast cancer and non-polyposis colon cancer, xero- 
derma pigmentosum and Cockayne's syndrome). 

One feature that sets Arahidopsis apart from other eukaryotes is 
the presence of additional homologues of many RAR genes. This is 
seen for almost every major class of DNA repair, including recom- 
bination (four RecA), DNA ligation (four DNA ligase I), photo- 
leactivation (one class II photolyasc and five class I photolyase 
homologues) and nucleotide excision repair (six RPA1> two RPA2> 
two Rad25, three TFB1 and four Rad23). This is most striking for 
genes with probable roles in base excision repair. Amhidopsis 
encodes 16 homologues of DNA base glycosylases (enzymes that 
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rccognjze abnormal DNA bases arid cleave them from the sugar- 
*> phosphate backbone)^— more than any other species known. This 
includes several homologues of each of three families of alleviation 
damage base glycosylases: two of the 5. cerevisiae MPG; six of the R 
coti TagI; and two of the B, coll AlkA. Arabidopsis also encodes three 
homologues of the apurinic-apyrimidmic (AP) endonuclease Xth. 
AP endonucleascs continue the base excision repair started by 
glycosylases by cleaving the DNA backbone at abasic sites, 

Evolutionary analysis indicates that some of the extra copies of 
RAR genes in Arabidopsis originated through relatively recent gene 
duplications — because many of the sets of genes are more closely 
related to each other than to their homologues in any other species. 
As duplication is frequently accompanied by functional divergence, 
the duplicate (paralogous) genes may have different repair specifi- 
cities or may have evolved functions titiat are outside RAR functions 
(as is the case for two of the five class I photolyase homologies, 
which function as blue-light receptors). In most cases, it is not 
known whether the paralogous gene copies have different functions. 
The presence of multiple paralogues might also allow functional 
redundancy or a greater repair or xccombination capacity 

The multiplicity of RAR genes in Arabidopsis is also partly due to 
the transfer of genes from the organellar genomes to the nucleus. 
Repair gene homologues that appear to be of chloroplast origin 
(Supplementary Information Tables 2 and 5) include the recombi- 
nation proteins RecA, RecG and SMS, two class I photolyase 
homologues, Fpg, two MutS2 proteins* and the transcription- 
repair coupling factor Mfd. Two of these (RecA and Fpg) are 
involved in RAR functions in the plastid, suggesting that the 
others may be as well. The finding of an Mfd orthologue of 
cy antibacterial descent is surprising. In K coli> Mfd couples nucleo- 
tide excision repair carried out by TJvrABC to transcription, leading 
to the rapid repair of DNA damage on the transcribed strand of 
transcribed genes 6 * The absence of orthologues of UvrABC in 
Arabidopsis renders the function of Mfd difficult to predict The 
presence of Mfd but not UvrABC has been, reported for only one 

i other species, a bacterial endosymbiont of the pea aphid. 

! Other nuclear-encoded Arabidopsis DNA repair gene homolo- 

gues are evolutionarily related to genes from o>Proteobacteria, and 

j thus may be of mitochondrial descent In particular, the six homo- 

! logues of the alkyl-basc glycosylase TagI appear to be the result of a 

I large expansion in plants after transfer from the mitochondrial 

genome, Whether any of these TagI homologues function in the 
repair and maintenance of mitochondrial DNA has not been 

i determined. More detailed phylogenetie analysis may reveal addi- 

tional Arabidopsis RAR genes to be of organcllar ancestry. 

There are some notable absences of proteins important for RAR 
in other species* including alkyltransferases, MSH4, RPA3 and many 
components of TFIIH (TFB2, TFB3, TFB4> CCLU Kin28). Ncver- 

] theless, Arabidopsis shows many similarities to the set of DNA repair 

1 genes found in other eukaryotes, and therefore offers an experi- 

mental system for determining the functions of many of these 
proteins, in part through characterization of mutants defective in 

| DNA repair* 7 , 

I 

I Gene regulation 

. Eukaryotie gene expression involves many nuclear proteins that 

j ( modulate chromatin structure, contribute to the basal transcription 

j machinery, or mediate gene regulation in response to development 

| tal, environmental or metabolic cues. As predicted by sequence 

|j similarity, more than 3,000 Sttch proteins may be encoded by the 

; j Arabidopsis genome, suggesting that it has a comparable complexity 

of gene regulation to other eukaryotes, Arabidopsis has an additional 
level of gene regulation, however, with DNA methylation potentially 
mediating gene silencing and parental imprinting, 
j Plants have evolved several variations on chromatin remodelling 

proteins, such as the family of HD2 histonedeacetylases^ Although 
j t Arabidopsis possesses the usual number of SNF2-type chromatin 



remodelling ATPases, which regulate the expression of nearly all 
genes, there are significant structural differences between yeast and 
met&zoan SNF2-type genes and their orthologues in Arabidopsis, 
DDM 1 , a member of the SNF2 superfamily, and MOM 1, a gene with 
similarity to the SNF2 family, are involved in transcriptional gene 
silencing in Arabidopsis, MOM1 has no clear orthologue in fungal or 
metazoan genomes. 

Consistent with its methylated DNA* Arabidopsis possesses 
eight DNA methyltransfetases (DMTs), Two of the three types 
are orthologous to mammalian DMT 09 whereas one* chromo- 
mcthyltransferasc 70 , is unique to plants. No DMTs are found in 
yeast or C. efcgans, although two DMT-Hke genes are found in 
Drosophila 71 * Arabidopsis also encodes eight proteins with methyl- 
DNA-binding domains (MBDs), Despite lacking methylated DNA, 
Drosophila encodes four MBD proteins and C elegans has two. 
These differences in chromatin components are likely to 
reflect important differences in chromatin-based regulatory 
control of gene expression in eukaryotes (Supplementary Informa- 
tion Table 6; http://Ag.Arizona,Edu/chromatin/chromatin-html), 

The Arabidopsis genome encodes transcription machinery for the 
three nuclear DNA-depcndent RNA polymerase systems typical of 
eukaryotes (Supplementary Information Table 6). Transcription by 
RNA polymerases II and III appears to involve the same machinery 
as is used in other eukarotes; however, most transcription factors for 
RNA polymerase 1 are not readily identified. Only two polymerase I 
regulators (other than polymerase subimits and TATA-binding 
protein) are apparent in Arabidopsis* namely homologues of yeast 
RRN3 and mouse TTF-l. A.U eukaryotes examined to date have 
distinct genes for the largest and second largest subunits of poly- 
merase I, II and III. Unexpectedly, Arabidopsis has two genes 
encoding si fourth class of largest subunit and second-largest 
subunit (Supplementary Information Fig. 5). It will be interesting 
to determine whether the atypical subunits comprise a polymerase 
that has a plant-specific function. Four genes encoding single- 
subunit plastid or mitochondrial RNA polymerases have been 
identified in Arabidopsis (Supplementary Information Table 6). 
Genes for the bacterial and a-subunits of RNA polymerase 

are also present, as are homologues of various cr-factors* and these 
proteins may regulate chloroplast gene expression. Mutations in the 
Sde-l gene, encoding RNA-dependent RNA polymerase (RdRp), 
lead to defective post-transcriptional gene silencing 71 * Wc also 
identified five more closely related RdRp genes. 

Our analysis, using both similarity searches and domain matches* 
has identified 1,709 proteins with significant similarity to known 
classes of plant transcription factors classified by conserved DNA- 
binding domains. This analysis used a consistent conservative 
threshold that probably underestimates the size of families of 
diverse sequence. This class of protein is the least conserved 
among all classes of known proteins, showing only 8-23% similar- 
ity to transcription factors in other eukaryotes (Fig. 2b)* This 
reduced similarity is due to the absence of certain classes of 
transcription &ctors in Arabidopsis and large numbers of plant- 
Specific transcription factors. We did not detect any members of 
several widespread families of transcription factors* such as the REL 
(ReUike DNA-binding domain) homology region proteins, nuclear 
Steroid receptors and forkhead-winged helix and POU (Pit- 1, Oct- 
and Unc-8b) domain families of developmental regulators* Con- 
versely, of 29 classes of Arabidopsis transcription factors, 16 appear 
to be unique to plants (Supplementary Information Table 6). 
Several of these, such as the AP2/EREBP-RAV> NAC and ARF- 
ADX/IAA families, contain unique DNA-binding domains, whereas 
Others contain plant-specific variants of more widespread domains, 
such as the DOF and WRKY zinc- finger families aud the two-repeat 
MYB family. 

Functional redundancy among members of large families of 
closely related transcription factors in Arabidopsis is a significant 
potential barrier to their characterization 73 . For example, in the I 
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SHATTERPROOF and SEPAIXATA families of MADS box tran- 
scription, factors, all genes must be defective to produce visible 
mutant phenotypes 7 " 1,7 *. These functionally redundant genes are 
found on the segmental duplications described above. Our analyses, 
together with the significant sequence similarity found in large 
families of transcription factors such as the R2R3 -repeal MYB and 
W'RKY families, suggest that strategies involving ovetexprcssion will 
be important in determining the functions of members of tran- 
scription factor families, 

Arabidopsis has two or over three times more transcription factors 
than identified in Drosophil<P or C>etegans\ respectively. The sig- 
nificantly greater extent of segmental chromosomal and local tandem 
duplications in the Arabidopsis genome generates larger gene families, 
including transcription factors. The partly overlapping functions 
defined for a few transcription factors are also likely to be much 
more widespread, implicating many sequence-related transcription 
factors in the same cellular processes. Finally* the expanded number 
of genes involved in metabolism, defence and environmental inter- 
action m Arabidopsis (Fig. 2a), which have few counterparts in 
Drosophih and C elegans, all require additional numbers and classes 
of transcription factors to integrate gene function in response to a 
vast range of developmental and environmental cues. 

Cellular organization 

Plant cells differ from animal cells in many features such as plastids, 
vacuoles, Golgi organization cytoskeletal arrays, plasmodesmata 
linking cytoplasms of neighbouring cells, and a rigid polysacchar- 
idc-rich extracellular matrix — the cell wall. Because the cell wall 
maintains the -position of a cell relative to its neighbours, both 
changes in cell shape and organized cell divisions, involving cytos- 
keleton reorganization and membrane vesicle targeting, have major 
roles in plant development. Plant cytokinesis is also unique in 
that the partitioning membrane is formed d& novo by vesicle fusion. 
We compared the Arabidopsis genome with those of C. elegans, 
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Drosophila and yeast to glimpse the genetic basis of pla.nt~cell- 
specinc features. 

The principal components of the plant cytoskeleton are micro- 
tubules (MTs) and actin filaments .(APs); intermediate filaments 
(IPs) have not been described in pl&ntsSArabidvpsis appears to lack 
genes for cytokeratin or vimentin, the main components of animal 
IPs, but has several variants of actin, a.- and |3 -tubulin. The 
Arabidopsis genome also encodes homologues of chaperones that 
mediate the folding of tubulin and actin polypeptides in yeast and 
animal cells, such as the prefoldm and cytosolic chaperonui com- 
plexes and tubulin- folding cofactors. The dynamic stability of MTs 
and AFs is influenced by MT-assodated proteins and actin-bindtng 
proteins, respectively, several of which are encoded by Arabidopsis 
genes. These include the MT-severing ATPase katanin, AF-cross- 
linking/bundling proteins, such as ftmbrins and viHins, and AF- 
disassembling proteins, such as profilin and actirt-dcpolymerizmg 
factor/ cofiUn, The Arabidopsis proteome appears to lack homolo- 
gues of proteins that, in animal cells, link the actin cytoskeleton 
across the plasma membrane to the extracellular matrix, such as 
intcgrin, talin, spectrin, a-actinin, vitronectin or vmculin. This 
apparent lack of 'anchorage' proteins is consistent with the different 
composition of the cell wall and with a prominence of cortical MTs 
at the expense of cortical AFs in plant cells. 

Plant- specific cytoskeletal arrays include interphase cortical MTs 
mediating cell shape, the preprophase band marking the cortical site 
of cell division, and the phragrnoplast assisting in cytokinesis 76 . 
Although plant cells lack structural counterparts of the yeast spindle 
pole body and the animal centrosome, Arabidopsis has homologues 
of core components of the MT- nucleating 7- tubulin ring complex* 
5uch as ^tubulin, Spc97/hGCP2 and Spc98/hGCP3, Arabidopsis 
has numerous motor molecules, both kinesms and dyneins with 
associated dynactin complex proteins, which are presumably 
involved in the dynamic organization of MTs and in transporting 
Cargo along MT trades. There are also myosin motors thai may be 
involved in AF-supported organelle trafficking. Essential features of 
the eukaryotic cytoskelcton appear to be conserved in Arabidopsis. 

The Arabidopsis genome encodes homologues of proteins 
involved in vesicle budding including several ARFs and ARF- 
related small G-proteinS* large but not small ARF GEFs (adenosine 
ribosylation factor on guanine nucleotide exchange factor), adapter 
proteins, and coat proteins of the COP and non-COP types. 
Arabidopsis also has homologues of proteins involved in vesicle 
docking and fosion, including SNAP receptors (SNAJlEs), N"- 
elhylmaieimide-sensitive factor (N5F) and Cdc4fi-related ATPases, 
accessory proteins such a$ Seel and soluble NSP attachment protein 
(SNAP), and Rab-type GTPases, The large number of Arabidopsis 
SNAREs can be grouped by sequence similarity to yeast and animal 
counterparts involved in specific trafficking pathways, and some 
have been localized to the trans-Goigi and the pre-vacuolar 
pathway 77 . Arabidopsis also has a receptor for retention of proteins 
in the endoplasmic reticulum, a cargo receptor for transport to the 
vacuole and several phragmoplastins related to Animal dyuamui 
GTPases. Thus, plant cells- appear to use the same basic machinery 
for vesicle trafficking as yeast and animal cells. 

Animal cells possess many functionally diverse small G-protrins 
af the Ras superfamily involved in signal iransdxiction, AF reorga- 
nization, vesicle fusion and other processes. Surprisingly, 
Arabidopsis appears to lack genes for G-proteins of the Ras, ftho ? 
Rac and Cdc42 subfamilies but has many Rab-type G-proteins 
involved in vesicle fusion and several Rop-type C-protcins, one of 
which has a role in actin organization of the tip-growing pollen 
tube™. The significance of this divergent amplification of different 
subfamilies of small G-proteins in plants and animals remains to be 
determined. 

Arabidopsis possesses cyclin-dcpeiident kinases (CDKs), includ- 
ing a plant-specific CdcZb kinase expressed in a cell- cycle- depen- 
dent manner, several cyeim subtypes, including a D-type cyclin ilmi 
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mediates cytokiiiin-stimulated ceU-cycle progression 79 , a retinoblas- 
noma-related grotcin arid components of the ubiquitin-dependent 
proteolytic pathway of cyclin degradation. In yeast and animal cells> 
chromosome condensation is mediated by condensins, sister chro- 
matids are held together by cohesins such as Sccl, and rnctaphase- 
anaphase transition is triggered by separfri/Espl endopeptidase 
proteolysis of Scci on APG-mediated degradation of its inhibitor, 
securin/Psdl. Related proteins are encoded by the Arabidopsis 
genome. Thus, the basic machinery of cell-cycle progression, 
genome duplication and segregation appears to be conserved in 
plants. By contrast, entry into M phase, M-phase progression and 
cytokinesis seem to be modified in plant cells. Arabidopsis does not 
appear to have homologucs of Cdc25 phosphatase, which activates 
Cdc2 kinase at the onset of mitosis, or of polo kinase, which 
regulates M-phase progression in yeast and animals. Conversely, 
plant^specific mitogen-actived protein (MA?) kinases appear to be 
involved in cytokinesis. 

Cytokinesis partitions the cytoplasm of the dividing cell. Yeast 
and animal cells expand the membrane from the surface towards the 
centre in a cleavage process supported by septins and a contractile 
ring of actin and type IE myosin. By contrast, plant cytokinesis starts 
in the centre of the division plane and progresses laterally A 
transient membrane compartment, the cell plate, is formed de 
novo by fusion of Golgi-derived vesicles trafficking along the 
phragmoplast Mis 80 . Consistent with the unique mode of plant 
cytokinesis, Arabidopsis appears to lade genes for septins and type II 
myosin. Conversely, cell-plate formation requires a cytokinesis- 
specific syntaxin that has no dose homologue in yeast and animals. 
Although syntaxin-mediated membrane fusion occurs in animal 
cytokinesis and ceUularization, the vesicles are delivered to the base 
of the cleavage furrow- Thus, the plant-specific mechanism of cell 
division is linked to conserved eukaryotic cell-cycle machinery. 

Two main conclusions are suggested by this comparative analysis. 
First, Arabidopsis and eukaryotic cells have common features related 
to intracellular activities, such as vesicle trafficking* cytoskeieton 
and cell cycle. Second, evolutionarily divergent features* such as 
organization of the cytoskeleton and cytokinesis, appear to relate to 
the plant cell wall. 

Development 

The regulation of development in Arabidopsis* as in animals, 
involves cell-cell communication, hierarchies of transcription fac- 
tors, and the regulation of chromatin state; however, there is no 
reason to suppose that the complex multicellular states of plant and 
animal development have evolved by elaborating the same general 
processes during the 1.6 billion years since the last common uni- 
cellular ancestor of plants and animals 81 -* 2 . Our genome analyses 
reflect the long, independent evolution of m&lfty processes contri- 
buting to development in the two kingdoms. 

Plants and animals have converged on similar processes of pattern 
formation, but have used and expanded different transcription 
factor families as key causal regulators. For example, segmentation 
in insects and differentiation along the anterior-posterior and limb 
axes in mammals both involve the spatially specific activation of a 
series of homeobox gene family members. The pattern of activation 
is causal in the later differentiation of body and limb axis regions. In 
plants the pattern of floral whorls (sepals, petals, stamens, carpels) is 
also established by the spatially specific activation of members of a 
family of transcription factors, but in this instance the family is the 
MADS box family. Plants also have homeobox genes and animals 
have MADS box genes, implying that each lineage invented sepa- 
rately its mechanism of spatial pattern formation* while converging 
on actions and interactions of transcription factors as the mechan- 
ism, Other examples show even greater divergence of plant and 
animal developmental control. Examples are the AP2/EREBP and 
NAC families of transcription factors, which have important roles in 
flower and meristem development; both families are so far found 



only in plants (Supplementary Information Table 6), 

A similar story can be told for cell-cell communication. Plants do 
not seem to have receptor tyrosine kinases^ but the Arabidopsis 
genome has at least 340 genes for receptor Scr/Thr kinases, belong- 
ing to many different families, defined by their putative extracellular 
domains (Supplementary Information Table 7). Several families 
have members with known functions in cell-cell communication* 
such as the CLV1 receptor involved in meristem cell signalling* the 
S-glycoprotein homologues involved in signalling from pollen to 
Stigma in self-incompatible Brassica species, and the BRI1 receptor 
necessary for brassinosteroid signalling 33 . Animals also have recep- 
tor Ser/Thr kinases, such as the transforming growth factor-P 
(TGF-f3) receptors, but these act through SMAD proteins that are 
absent from Arabidopsis. The leucine- rich repeat (LRR) family of 
Arabidopsis receptor kinases shares its extracellular domain with 
many animal and fungal proteins that do not have associated kinase 
domains, and there are at least 122 Arabidopsis genes that code for 
LRU proteins without a kinase domain. Other Arabidopsis receptor 
kinase families have extracellular domains that are unfamiliar in 
animals. Thus, evolution is modular, and the plant and animal 
lineages have expanded different families of receptor kinases for a 
similar set of developmental processes. 

Several Arabidopsis genes of developmental importance appear to 
be derived from a cyanobacteria-like genome (Supplementary 
Information Table 2), with no close relationship to any animal or 
fungal protein. One salient example is the family of ethylene 
receptors; another gene family of apparent chloroplast origin is 
the phytochromes — light receptors involved in many developmen- 
tal decisions (see below). Whereas the land plant phytochromes 
show dear homology to the cyanobacterial light receptors, which 
arc typical prokaryotic histidine kinases, the plant phytochromes 
are histidine kinase paralogues with Ser/Thr specificity 84 . Similarly 
to the ethylene receptors, the proteins that act downstream of plant 
phytochrome signalling are not found in cyanobacteda, and thus it 
appears that a bacterial light receptor entered the plant genome 
through horizontal transfer, altered its enzymatic activity, and 
became linked to a eukaryotic signal transduction pathway. This 
infusion of genes from a cyanobacterial endosymbiont shows that 
plants have a richer heritage of ancestral genes than animals, and 
unique developmental processes that derive from horizontal gene 
transfer. 

Signal transduction 

Being generally sessile organisms, plants have to respond to local 
environmental conditions by changing their physiology or redirect- 
ing their growth. Signals from the environment include light and 
pathogen attack, temperature, water, nutrients, touch and gravity. 
In addition to local cellular responses, some stimuli are commu- 
nicated across the plant body, with plant hormones and peptides 
acting as secondary messengers, Some hormones, such as auxin, are 
taken up into the cell, whereas others, such as ethylene and 
brassinosteroids, and the peptide CIV3, act as ligands for receptor 
kinases on the plasma membrane. No matter where the signal is 
perceived by the cell, it is transduced to the nucleus, resulting in 
altered patterns of gene expression. 

Comparative genome analysis between Arabidopsis, C etegans 
and Drosophila supports the idea that plants have evolved their own 
pathways of signal transduction 85 . None of the components of the 
widely adopted signalling pathways found in vertebrates, flies or 
worms, such as Wingless/Wnt, Hedgehog, Hotch/lml2, JAK/STAT, 
TGF-p/SMADs, receptor tyrosine kinase/Ras or the nuclear steroid 
hormone receptors* U found in Arabidopsis, By contrast, brassinos- 
tcroids arc ligands of the BBI1 Ser/Thr kinase, a member of the 
largest recognizable class of transmembrane sensors encoded by 
340 receptor-like kinase (RLK) genes in the Arabidopsis genome 
(Supplementary Information Table 7). With a few notable excep- 
tions, such as CLV1, the types of ligands sensed by RLKs arc 



808 



NATURE) VOL 40S| \4 PBCEMBER 200O|wwW.naiure T com 



®028 



02 FRI 1S:03 FAX 18605725240 



completely unknown* providing an enormous future challenge for 
plant biologists. G-protein-coupled receptors (GPCRs)/ seven- 
transmembrane proteins are an abundant class of proteins in 
mammalian genomes, instrumental in signal transduction. INTER- 
PRO detected 27 GPCR^rclated domains in Arabidopsis (Supple- 
mentary Information Table 1), although there is no direct 
experimental evidence for Ihese, Arabidopsis contains a family of 
18 seven-transmembrane proteins of the mildew resistance (MIO) 
class, several of which are involved in defence responses. Notably, 
only single Get (GPAI) and Gp (AGB1) subunits are found in 
Arabidopsis, both previously known 8 *. 

Although cyclic GMP has been proposed to be involved in signal 
transduction in Arabidopsis* 7 , a protein containing a guanylate 
cyclase domain was not identified in our analyses. Nevertheless, 
cyclic nucleotide-bindhig domains were detected in various pro- 
teins indicating that cNMPs may have a role in plant signal 
transduction. Thus, although cNMP-binding domains appear to 
have been conserved during evolution, cNMP synthesis in 
Arabidopsis may have evolved independently. 

We were unable to identify a protein with significant similarity to 
known Oy subunits* but recent biochemical studies suggest that a 
protein with this functional capacity is likely to be present in plant 
cells <fi Ma, personal communication). Therefore, there is poten- 
tial for the formation of only a single heterotrimcric G-protein 
complex; however, its functional interaction with any of the poten- 
tial GPCR-related proteins remains to be determined. 

Modules of cellular signal pathways from bacteria and animals 
have been combined and new cascades have been innovated in 
plants. A pertinent example is the response to the gaseous plant 
hormone ethylene 8 *. Ethylene is perceived and its signal transmitted 
by a family of receptors related to bacterial-type two-component 
histidine kinases (HKs). In bacteria, yeast and plants, these proteins 
sense many extracellular signals and function in a His-to-Asp 
phosphorelay network 69 . In turn, these proteins physically interact 
with the genetically downstream protein CTR1, a Raf/MAPKKK- 
related kinase, revealing the juxtaposition of bacterial-type two- 
component receptors and animal-type MAP kinase cascades. Unlike 
animals, however, Arabidopsis does not seem to have a Ras protein 
to activate the MAP kinase cascade, MAP kinases are found in 
abundance in Arabidopsis: we identified —20, a higher number than 
in any other eukaryote. As potentially counteracting components, 
we found —70 putative PP2C protein phosphatases. Although this 
group is largely uncharacterixed functionally, several members a*e 
related to ABU/ABI2, key negative regulators in the signalling 
pathway for the plant hormone abscisic acid. Additional compo- 
nents of the His-to-Asp phosphorelay system were also found in 
Arabidopsis, including authentic response, regulators (ARRs), pseu- 
doresponse regulators (PRRs) and phosphotransfer intermediate 
protein (HPt) 90 . We found 1 1 HKs in the proteome (3 new), 16 RRs 
(2 new) and 8 PRRs (2 new). The biological roles of most ARR$» 
PRRs and HPts are largly unknown but several have been found 
to have diverse functions in plants, including transcriptional activa- 
tion in response to the plant hormone cytokinm 91 , and as compo- 
nents of the circadian clock M . 

Plants seem to have evolved unique signalling pathways by 
combining a conserved MAP kinase cascade module with new 
receptor types, m many eases, however* the ligands are unknown* 
Conversely, some known signalling molecules, such as auxin, are 
still in search of a receptor. Auxin signalling may represent yet 
another plant-specific mode of signalling, with protein degradation 
through the ubiquitin-proteasome pathway preceding altered gene 
expression. With many Arabidopsis genes encoding components of 
the ubiquitin-proteasome pathway, elimination of negative regula- 
tors may be a more widespread phenomenon in plant signalling. 

Recounting and responding to pathogens 

Plants are constantly exposed to pests, parasites and pathogens and 
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have evolved many defences. In mammals, polymorphism for 
parasite recognition encoded in the MHC genes contributes to 
resistance. In plants, disease resistance (R) genes that confer parasite 
recognition are also extremely polymorphic. This polymorphism 
has been proposed to restrict parasites, and its absence may explain 
the breakdown of resistance in crop monocultures 93 . In contrast to 
MHC genes, plant resistance genes are found at several loci, and the 
complete genome sequence enables analysis of their complement 
and structure. Parasite recognition by resistance genes triggers 
defence mechanisms through various signalling molecules, such as 
protein kinases and adapter proteins, ion fluxes, reactive oxygen 
intermediates and nitric oxide. These halt pathogen colonization 
through transcriptional activation of defence genes and a form of 
programmed cell death called the hypersensitive response 94 . The 
Arabidopsis genome contains diverse resistance genes distributed at 
many loci 4 along with components of signalling pathways* and 
many other genes whose role in disease resistance has been inferred 
from mutant phenotypes, 

Most resistance genes encode intracellular proteins with a nucleo- 
tide-binding (NB) site typical of small G proteins, and carboxy- 
terniuial LRRs 35 . Their amino termini either carry a TIR domain, or 
a putative coiled coil (GC). There are $5 TIR-NE-LRR resistance 
genes at 64 loci, and 36 CC-NB-LRR resistance genes at 30 loci, 
Some NB-LRR resistance genes express neither obvious TIR nor 
CC domains at their K termini. This potential class is present seven 
times 4 at six loci. There are 15 truncated TIR-NB genes that lack an 
LRR at 10 loci, often adjacent to full TIR-NB-LRR genes. There are 
also six CC-NB genes, at five loci. These truncated products may 
function in resistance, Intiiguingiy, two TIR-NB-LRR genes carry 
a WRKY domain, found in transcription factors that are implicated 
in plant defence, and one of these also encodes a protein kinase 
domain. 

Resistance gene evolution may involve duplication and diver- 
gence of linked gene families 36 ; however, most (46) resistance genes 
are singletons; 50 are in pairs, 21 are in 7 clusters of 3 family 
members, with single clusters of 4, 5> 7, 8 and 9 members, 
respectively. Of the non-singletons, -60% of pairs are in direct 
repeats, and —40% are in inverted repeats. Resistance genes are 
unevenly distributed between chromosomes, with 49 on chromo- 
some I; 2 on chromosome 2; 16 on chromosome 3; 28 on chromo- 
some 4; and 55 on chromosome 5. 

In other plant species, resistance genes encode both transmem- 
brane receptors for secreted pathogen products and protein kinases, 
and some other classes are also found. The Of genes in tomato 
encode extraceDular LRRs with a transmembrane domain and short 
cytoplasmic domain. Mutation in an Arabidopsis homologue, 
ClAVATA2 f results in enlarged meristcms, but to date no resistance 
function has been assigned to the 30 Arabidopsis CLV2 homoiogues, 
CLAVATA1, a transmembrane LRR kinase, is also required for 
meristem function. Xa2h a rice LRR-kinase, confers Xanihomonas 
resistance, and the Arabidopsis FL$2 LRR kinase confers recognition 
of flagelUn. It has been proposed that CLVl and CLV2 function as a 
heterodimer; perhaps this is also true for Xa21, FLS2 and Cf 
proteins. There are 174 LRR transmembrane kinases in 
Arabidopsis, With only FLS2 assigned a role in resistance, A unique 
resistance gene, beet H$lpro*l> which confers nematode resistance, 
has two Arabidopsis homoiogues. 

The tomato Pto Ser/Thr kinase acts as a resistance protein in 
conjunction with an NB-LRR protein, so similar kinases might do 
the same for Arabidopsis NB-LRR proteins. There are 860 Ser/Thr 
kinases in the Arabidopsis sequence. Fifteen of these share 50% 
identity over the Pto-aligned region. The Toll pathway in Drosophila 
and mammals regulates innate immune responses through 
LRR/TIR domain receptors that recognize bacterial lipopo^ 
^saccharides 96 . Pto is highly homologous to Drosophla PELLE 
and mammalian IRAK protein kinases that mediate the TIR 
pathway, 



NATURE j VOL 408 1 14 DfcCEMBBft 2000|www.&atii[*,coiii 



009 



/OS/02 FRI 18:04 FAX 18605725240 



DEKALB 



Additional genes have been defined that are required for resis- 
* tance by our analysis of the genome sequence. The ndrl mutation 
defines a gene required by the CC-NR-LRR gene RPS2 and fiPMi. 
NDRl is 1 of 28 Arabidopsis genes that are similar both to each other 
and to the tobacco HINl gene that is transcriptionally induced early 
during the hypersensitive response. EDS I is a gene required for 
TIR-NB-LRR function* and like PAD4> encodes a protein with a 
putative lipase motif. PAD4 and a third gene comprise the 

EDS1/PAP4 family. The NPItl/NIMl/SAh gene is required for 
systemic acquired resistance, and we found five additional NPRl 
homoiogues. Recessive mutations at both the barley Mlo and 
Arabidopsis LSDl bet confer broad-spectrum resistance and dere- 
press a cell-death program. There ate at least 18 Mlo family 
members that resemble heterotrimeric GPCRs in Arabidopsky and 
only two LSDl homoiogues. 

One of the earliest responses to pathogen recognition is the 
production of reactive oxygen intermediates. This involves a spe- 
cialized respiratory burst oxidase protein that transfers an electron 
across the plasma membrane to make superoxide. Ambidopsis 
encodes eight apparently functional gp$l homoiogues, called 
Atrboh genes. Unlike %p9h they all carry an —300 amino*acid N- 
terminal extension carrying an EF-hand Ca 2+ -binding domain. In 
mammals, activation of the respiratory oxidative burst complex in 
the neutrophil, which includes gp91, requires the action of Rac 
proteins. As no Rac or Ras proteins are found in Arabidopsis, 
members of the large rop family of G proteins may carry this out 
Similarly, we did not detect any Arabidopsis homoiogues of other 
mammalian respuatory burst oxidase components (p22, p47» p67, 
p40). 

There arc no clear homoiogues of many mammalian defence and 
cell-death control genes. Although nitric oxide production is 
involved in plant defence, there is no obvious homologue of nitric 
oxide synthase. Also absent are apparent homoiogues of the REL 
domain transcription factors involved in innate immunity in both 
Drosophila and mammals. We found no similarity to proteins 
involved in regulating apoptosis in animal cells, such as classical 
caspases, bci2/ced9 and baculovvrus p35. There are, however, 36 
cysteine proteases. There arc also eight homoiogues of a newly 
defined metacaspasc family* 7 , two of which* along with LSD 1 , have a 
clear GATA-typc zinc-finger. 

Photomorphogenesis and photosynthesis 

Because nearly all plants are sessile and most depend on photo- 
synthesis, they have evolved unique ways of responding to light. 
Light serves as an energy source, as well as a trigger and modulator 
of complex developmental pathways^ including those regulated by 
the arcadian clock. Light is especially important during seedling 
emergence, where it stimulates chlorophyll production, leaf devel- 
opment, cotyledon expansion, chloroplast biogenesis and the coor- 
dinated induction of many nuclear- and chloroplast-encoded genes, 
while at the same time inhibiting stem growth. The goal of this 
process, called photomorphogenesis, is the establishment of a body 
plan that allows the plant to be an efficient photosynthetie machine 
under varying light conditions**. The signal transduction cascade 
leading to light-induced responses begins with the activation of 
photoreceptors, Next, the light signal is transduced via positively 
and negatively acting nuclear and cytoplasmic proteins* causing 
activation or derepression of nuclear and chloroplast-encoded 
photosynthetie genes and enabling the plant to establish optimal 
photoautotrophic growth. Although genetic and biochemical stud- 
ies have denned many of the components in this process* the 
genome sequence provides an opportunity to identify comprehen- 
sively Arabidopsis genes involved in photomorphogenesis and the 
establishment of photoautotrophic growth. We identified at least 
100 candidate genes involved in light perception and signalling* and 
139 nuclear-encoded genes that potentially function in photosynth- 
esis. 



The roles have been described of only 35 of the 100 candidate 
photomorphogenic genes (Supplementary Information Table 3). 
All of the light photoreceptors had been discovered previously, 
including five red/far-red absorbing phytochromes (PHYA-E), two 
bluc/ultraviolet-A absorbing cryptochromcs (CRYl and CRY2), 
one blue-absorbing phototropm (NPH1) and one NPHl -like (or 
NPLl). In contrast, we uncovered many new proteins similar to 
the photomorphogenesis regulators COP/DET/FU5> PKSl, PIF3, 
NDFK2, SPAl, FAR1, GIGANTEA, FIN219, HY5, CCA1, ATHB-2, 
ZEITLUPE, FKF1, LKP1* NPH3 and RPT2. 

Both the phytochromes and NPHl contain chrorncphores for 
light sensing coupled to kinase domains for signal transmission. 
Phytochromes have an N-terminal chromophore-binding domain, 
two £AS domains* and a C-terminal Ser/Thr kinase domain 9 ** 
whereas NPHl has two LOV domains (members of the PAS 
domain superramily) for flavin mononucleotide binding and a 
C-terminal Ser/Thr kinase domain 1 ™ PAS domains potentially 
sense changes in light> redox potential and oxygen energy levels, as 
well as mediating protein-protein interactions 9 * 1 IQD . We searched 
for uncharacterfoed proteins with the combination of a kinase 
domain and either a photochrome chromophore-binding site or 
PAS domains. Although we found no new phytochrome-Uke 
genes, we did identify four predicted proteins that contain PAS 
and kinase domains (Supplementary Information Fig. 6). These 
proteins share 80% amino-acid identity, but, unlike NPHl and 
NPLl, have only one PAS domain. The combination of potential 
signal sensing and transmitting domains makes it tempting to 
speculate that these proteins may be receptors for light or other 
signals, 

Our screen included searches for components of photosynthetie 
reaction centres and ttght-harvestmg complexes, enzymes involved 
in C0 2 fixation and enzymes in pigment biosynthesis, Wc identified 
1 1 core proteins of photosystem I, including the cukaryotic- specific 
components PsaG and P$aH l0l » and 8 photosystem II proteins, 
including a single member (psb W) of the photosystem II core. We 
also found 26 proteins similar to the Chiorophyll-a/b binding 
proteins (8 Lhca and IS Lhcb). Of the seven subunits of the 
cytochrome brf complex (PetA-D, PetG, PetL, PetM), only one 
(PetC) was found in the nuclear genome, whereas the remainder are 
probably encoded in the chloroplast. Similarly, of the nine subunits 
of the chloroplast ATP synthase complex, three are encoded in the 
nucleus, including the II- , y- and S-subunits; the remaining 
subunits (L III, IV, a, p> <0 are encoded in the chloroplast 102 . Ten 
genes were related to the soluble components of the electron transfer 
chain, including two plastocyanins, five fcrredoxins and three 
ferxedoxin/NADP oxidoreductases. Forty genes are predicted to 
have a role in CG 2 fixation, including all of the enzymes in the 
Calvin-Benson cycle. For pigment biosynthesis, 16 genes in chlor- 
ophyll biosynthesis and 3! genes in carotenoid biosynthesis were 
found (Supplementary Information Table 8). Our analyses have 
identified several potential components of the light perception 
pathway, and have revealed the complex distribution of components 
of the photosynthetie apparatus between nuclear and plastid 
genomes. 

Metabolism 

Ambidvpsis is an autotrophic organism that needs only minerals, 
light* water and air to grow. Consequently, a large proportion of the 
genome encodes enzymes that support metabolic processes, such as 
photosynthesis, respiration, intermediary metabolism, mineral 
acquisition, and the synthesis of lipids, fatty acids, amino acids, 
nucleotides and cofactors 103 , With respect to these processes, 
Arabidopsis appears to contain a complement of genes similar to 
those La the photoautotropic cyanobacteriurn Synechocystis^, but, 
whereas Syntchocystis generally has a single gene encoding an enzyme, 
Arabidopsis frequently has many. For example, Arabidopsis has at 
least seven genes for the glycolytic enzyme pyruvate kinase, with an 
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.additional five for pyruvate kinase-like proteins. Whatever the 
reason for this high level of redundancy, it varies from gene to 
gene in the same pathway; the 11 enzymes of glycolysis are encoded 
by up to 5 1 gene* that are present in as few as one or as many as eight 
copies, Similarly, of the 59 genes encoding proteins involved in 
glycerohpid metabolism, 39 are represented by more than one 
gene 104 - Genome duplication and expansion of gene families by 
tandem duplication have contributed to thi$ diversity. 

This high degree of apparent structural redundancy docs not 
necessarily imply functional redundancy. For instance, although 
there are seven genes for serine hydroxymethyltransferasc, a muta- 
tion in the gene for the mitochondrial form completely blocks the 
photorespiratory pathway 105 * Although there are 12 genes for 
cellulose synthase, mutations in at least 2 of the 12 confer distinct 
phenotypes because of tissue-specific gene egression 14 *. 

The metabolome ofArabidopsk differs from that of cyanobac- 
tcria, or of any other organism sequenced to date, by the presence of 
.many genes encoding enzymes for pathways that are unique to 
vascular plants. In particular* although relatively little is known 
about the cnzymology of cell-wall metabolism, more than 420 genes 
could be assigned probable roles in pathways responsible for the 
synthesis and modification of cell-wall polymers* twelve genes 
encode cellulose synthase, and 29 other genes encode 6 families of 
structurally related enzymes thought to synthesize other major 
polysaccharides 1 * Roughly 52 genes encode polygalacturonases, 
20 encode pectate lyases and 79 encode pectin esterases, indicating a 
massive investment in modifying pectin. Similarly, the presence of 
39 p-l,3-glucanases, 20 cndoxyloglucan transgiycosylascs, 50 cellu- 
lascs and other hydrolases, and 23 expansins reflects the importance 
of wall remodelling during growth of plant cells. Excluding ascor- 
bate and glutathione peroxidases* there are 69 genes with significant 
similarity to known peroxidases and 15 lactases (diphenol oxi- 
dases). Their presence in such abundance indicates the importance 
of oxidative processes in the synthesis of lignin, suberin and other 
cell- wall polymers. The high degree of apparent redundancy in the 
genes for cell-wall metabolism might reflect differences in substrate 
specificity by some of the enzymes. 

The high degree of apparent redundancy in the genes for cell wall 
metabolism might reflect differences in substrate specificity by some 
of the enzymes, It is already known that cell types have different wall 
compositions, which may require that the relevant enzymes be 
subject to cell-type-specific transcriptional regulation. Of the 40 or 
so cell types that plants make* almost all can be identified by unique 
ieatures of their cell wall 1 * 7 , A large number of genes involved in wall 
metabolism have yet to be defined. Although more than 60 genes for 
glycosyltransferases can be found in the genome sequence, most of 
these are probably involved in protein glycosylation or metabolite 
catabolism and do not seem to be adequate to account for the 
polysaccharide complexity of the wall. For instance, at least 21 
enzymes are required just to produce the linkages of the pectic 
polysaccharide RGII, and none of these enzymes has been identified 
at present. Thus, if these and related enzymes involved in the 
synthesis of oilier cell- wall polymers are also represented by multi^ 
pie genes, a substantial number of the genes of currently unknown 
function may be involved in cell- wall metabolism. 

Higher plants collectively synthesize more than 100,000 second' 
ary metabolites. Because flowering plants are thought to have 
similar numbers of genes, it is apparent that a great deal of 
enzyme creation took place during the evolution of higher plants. 
An important factor in the rapid evolution of metabolic complexity 
is the large family of cytochrome P450s that are evident in 
Arabidopsis (Supplementary Information Table 1). These enzymes 
represent a superfamily of haem-containing proteins, most of which 
catalyse NADPH- and O r dependent hydrcotylation reactions. Plant 
P450s participate in myriad biochcrm<=al pathways including those 
devoted to the synthesis of plant products* such as phenyipropa- 
;/ noids, alkaloids, terpenoids, lipids* cyanogenic glycosides and 
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glucosinolates* and plant growth regulators, such as gibbcreJlins, 
jasmonic acid and brassinosteroids. Whereas Arabidopsis has —286 
P450 genes, Drosophita has 94, C elegans has 73 and yeast has only 3, 
This low number in yeast indicates that there are few reactions of 
basic metabolism that are catalysed by P450s. It seems likely that 
many animal P450s are involved in detoxification of compounds 
from food plant sources. The role of endogenous enzymes is poorly 
understood; only a few dozen P450 enzymes from plants have been 
characterized to any extent. The discrepancy between the number of 
known P450-catalysed reactions and the number of genes suggests 
that Arabidopsis produces a relatively large number of metabolites 
that have yet to be identified. 

In addition to the large number of cytochrome P450s, Arabidopsis 
has many other genes that suggest the existence of pathways or 
processes that are not currently known. For instance, the presence of 
19 genes with similarity to a«thranilatc Af-hydroxyeinnamoyl/ 
benzoyl transferase is currently inexplicable. This en2yme is 
involved in the synthesis of dianthramide phytoalexins in Caryo- 
phyllaceae and Gramineae. No phytoalexins of this class have been 
described in Arabidopsis as yet. Similarly, the presence of 12 genes 
with sequence similarity to the berberine bridge enzyme, <(S)~ 
reticuIine:oxygen oxidoreductasc (mcthylene-bridge-forming); EC 

I, 5.3.9), and 13 genes with similarity to tropinooe reductase, 
suggests that Arabidopsis may have the ability to produce alkaloids. 
In other plants, the berberine bridge enzyme transforms reticuline 
into scoulerine, a biosynthetic precursor to a multitude of species- 
specific protopine, protoberberine and benzophenanthridine alka- 
loids. The discovery of these and many other intriguing genes in 
the Arabidopsis genome has created a wealth of new opportunities 
to understand the metabolic and structural diversity of higher 
plants. 

Concluding remarks 

The twentieth century began with the rediscovery of Mendel's rules 
of inheritance in pea m , and it ends with the elucidation of the 
complete genetic complement of a model plant, Arabidopsis. The 
analysis of the completed sequence of a flowering plant reported 
here provides insights into the genetic basis of the similarities and 
differences of diverse multicellular organisms. It also creates the 
potential for direct and efficient access to a much deeper under- 
standing of plant development and environmental responses, and 
permits the structure and dynamics of plant genomes to be assessed 
and understood. 
Arabidopsis^ C elegans and Drosophita have a similar range of 

II, 000-15,000 different types of proteins, suggesting this is the 
minimal complexity required by extremely diverse multicellular 
cukaryotes to execute development and respond to their environ- 
ment. We account for the larger number of gene copies in 
Arabidopsis compared with these other sequenced eukaryotes with 
two possible explanations. First, independent amplification of 
individual genes has generated tandem and dispersed gene families 
to a greater extent in Arabidopsis, and unequal crossing over may be 
the predominant mechanism involved. Second, ancestral duplica- 
tion of the entire genome and subsequent rearrangements have 
resulted in segmental duplications. The pattern of these duplica- 
tions suggests an ancient polyploidy event, and mutant analysis 
indicates that at least some of the many duplicate genes are 
functionally redundant. Their occurrence in a functionally diploid 
genetic model Came as a surprise, and is reminiscent of the situation 
in maize> an ancient segmental allotetraploid. The remarkable 
degree of genome plasticity revealed in the large-scale duplications 
may be needed to provide new functions, as alternative promoters 
and alternative splicing appear to be less widely used in plants than 
they are in ariimals. Apart from duplicated segments, the overall 
chromosome structure of Arabidopsis closely resembles that of 
Drosophilai transposons and other repetitive sequences are concen- 
trated in the heterochromatic regions surrounding the centromere, 

an 



08/02 PR I 18:05 FAX 1S605725240 

amcies 



DEKALB 



whereas the euchromatfc aims ate largely devoid of repetitive 
sequences^ Conversely, most protein-coding genes reside in the 
euchromiatin, although a number of expressed genes have been 
identified in centromeric regions. Finally, Arabidopsis is the first 
methylated eukaryotic genome to be sequenced, and will be invalu- 
able in the study of epigenetic inheritance and gene regulation. 

Unlike most animals, plants generally do not flaovc, they can 
perpetuate indefinitely, they reproduce through an extended hap- 
ioid phase, and they synthesize all their metabolites. Our compar- 
ison of Arabidop$i$> bacterial, fungal and animal genomes starts to 
define the genetic basis for these differences between plants and 
other life forms, Basic intracellular processes, such as translation or 
vesicle trafficking, appear to be conserved across kingdoms, reflect- 
ing a common eukaryotic heritage. More elaborate intercellular 
processes* including physiology and development, use different sets 
of components. For example, membrane channels, transporters and 
signalling components are very different in plants and animals* and 
the" large number of transcription factors unique to plants contrasts 
with the conservation of many chromatin proteins across the three 
eukaryotic kingdoms. Unexpected differences between seemingly 
similar processes include the absence of intracellular regulators of 
cell division (Cdc25) and apoptosis (Bcl-2). On the other hand, 
DNA repair appears more highly conserved between plants and 
mammals than within the animal kingdom, perhaps reflecting 
common factors such as DNA raethylation. Our analysis also 
shows that many genes of the endosymbiotfc ancestor of the plastid 
have been transferred to the nucleus, and the products of this rich 
prokaryotic heritage contribute to diverse functions such as photo- 
autotfophic growth and signalling. 

The sequence reported here changes the fundamental nature of 
plant genetic analysis. Forward genetics is greatly simplified as 
mutations are more conveniently isolated molecularly, but at the 
same time extensive gene duplications mean that functional redun- 
dancy must be taken into account. At a biochemical level, the 
specificity conferred by nucleotide sequence* and the completeness 
of the survey allow complex mixtures of RNA and protein to be 
resolved into their individual components using micro-arrays and 
mass spectrometry* This specificity can also be used in the parallel 
analysis of genome-wide polymorphisms and quantitative traits in 
natural populations 1 "*. Looking ahead, the challenge of determining 
the function of the large set of predicted genos, many of which are 
plant-specific* is now a clear priority, and multinational programs 
have been initiated to accomplish this goal using site-selected 
mutagenesis among the the necessary tools 110 . Finally, productive 
paths of crop improvement* based on enhanced knowledge of 
Arabidopsis gene iurtcfion, will help meet the challenge of sustaining 
our food supply in the coming years. 

Note added in proof: at the time of publication 1 7 centromeric BACs 
and 5 sequence gaps in chromosome arms are being sequenced. □ 

Methods 

The three centres used similar annotation approaches involving in silico gene-finding 
methods, comparison to EST and protein databaies, and manual reconciliation of that 
dafca. Gene finding involved three step*: <1) analysis of BAC sequences using a computa- 
tional gene tinder; (2) alignment of the sequence to the protein and TEST databases* (3) 
assignment of function* to each of the genca, Genacan' 11 , GeneMark.HMM 1 1 \ Xgrail"* 
Genefindcf <R Grten k unpublished software) and GlimmerA 1 " were used to analyse BAC 
sequence*. All of these systems were specially rained fof As-abidapsis genes. Splice sites 
were predicted using Ne*Gene2 m , Splic* Predictor 11 ' and GcneSplicer (M. Pertea and 
S. Salsberg, unpublished software). For the second step, BACs were aligned to ESTs and to 
the ArabMopsis gene *ftdex 1|T aiing programs such a* PD5/OAF2 lW or BLASTN 11 '. 
Segmental duplications were analysed and displayed using a modified version of 
D1ALIGN2 (ref. 120). 
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Arabidopsis Map-Based Cloning in the Post-Genome Era 



Georg Jander*, Susan R. Noixlsy Steven D. Rounslcy, David R Bush, Irena M. Levin 1 , and Robert L. Last 

Cereon Genomics LLC, 45 Sidney Street, Cambridge, Massachusetts 02139 



Map-based darting is aw iterative approach that identifies the underlying gene-tic cause of a mutant phenotype. The .major 
strength of this approach is the ability to tap into a nearly unlimited resource of natural and induced genetic variation 
without prior assumptions or knowledge of specific genes. One begins with an interesting mutant and allocs plant biology 
to reveal what gene or genes are involved, Three major advances in the past 2 years have mad*? map»based cloning in 
Arabidopsis fairly routine: sequencing of the Arabidopsis genome, the availability of more than 50/100 markers in the Cereon 
Arabidopsis Polymorphism Collection, and improvements in the methods used for detecting DNA polymorphisms. Here, 
we describe Hie Cereon Collection and show how it can be used in a generic approach to mutation mapping in Arabidopsis, 
We present the map-based cloning of the VTC2 gene as a specific example of this approach. 



Map-based cloning, also called positional cloning, 
is the process of identifying the genetic basis of a 
mutant phenotype by looking for linkage to markers 
whose phy&ical location in the genome is known. The 
amount of effort required for map-based cloning of 
genes in Arabidopsis has dropped dramatically in 
recent years (Fig. 1). Only a few years ago, it was 
necessary to build a physical map, develop markers, 
and iterafcively zero-in on the gene by ''chromosome 
walking/' This was followed by cloning, complemen- 
tation by transformation, and de novo determination 
of the sequence of the entire region of interest to high 
quality without a previously determined wild-type 
DNA sequence as a guide (Arondel et aL, 1992; Gi- 
raudat et al., 1992; Leung et al, 1994; Meyer et al., 
1994; Mindrinos et aL, 1994)/ 

Many of the steps of chromosome walking have 
been eliminated or have been made much easier by 
three nearly simultaneous breakthroughs during the 
past 2 years: sequencing of the entire Columbia (Col-0) 
Arabidopsis genome (The Arabidopsis Genome Initia- 
tive, 2000), the availability of tens of thousands of 
randomly distributed generic marker© to registered 
users of the Cereon Arabidopsis Polymorphism Col- 
lection (http;//www f arabidopsis.org/cereon/), and 
advances in the methods used to detect DNA poly- 
morphisms. One can now proceed from a mutant with 
a desirable phenotype to an identified mutation in a 
gene with less than one person-year of effort (Fig. 1). 
The minimal start-to-flnish time of a mapping project 
has also been shortened significantly, making it pos- 
sible to find a gene using an iterative approach taking 
approximately 1 year (Pig. 2). 

tn the process of map-based cloning, one starts 
with a mutant and eventually identifies the gene 
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responsible for the altered phenotype, allowing the 
plant to tell you what genes are important in the 
physiological process of interest. This is in contrast to 
reverse genetic approaches, which tend to rely on 
some sort of prior knowledge that the gene that is 
being mutated will be interesting. When using re- 
verse genetic approaches, such as tilling for point 
mutations {McCaUum et al, 2000) or searching for 
T-DNA insertion mutations (Sussman et at., 2000), 
one starts with a gene of interest, finds a mutation in 
that gene, and then looks for a phenotype. 

The big advantage to map-based cloning is that it is 
a process without prior assumptions. Essentially, one 
is looking at all of the genes in the genome at the 
same time to find the ones that affect the phenotype 
of interest. It is a process of discovery that makes it 
possible to find mutations anywhere in the genome, 
including intergenic regions and the 40% of Arabi- 
dopsis genes that do not resemble any gene with 
known or inferred function (The Arabidopsis Ge- 
nome Initiative, 2000). 

Insertionai mutagenesis using T-DNA or trans- 
posons has become increasingly popular as a tool for 
gene discovery. Pools of lines representing more than 
200,000 insertionai mutations are available from Ara- 
bidopsis stock centers (http://www.Arabidopsis.org/ 
abrc; ' http://na.se .nott.acuk). Large-scale projects are 
under way for disrupting most genes in Arabidopsis 
by insertionai mutagenesis (Sussxnan et al„ 2000). Mu- 
tant screens performed, using these populations are 
undoubtedly worthwhile and can lead to rapid iden- 
tification of the gene of interest if it is actually has a 
T-DNA or transposon insertion, However, there are 
also several good reasons to screen for mutants in 
chemically mutagenized populations and to isolate 
the affected genes by map-based cloning. 

Insertionai mutations tend to result in complete 
knockouts of the gene, making it difficult to associate 
a phenotype other than death with essential genes. In 
contrast, chemical mutagenesis, e.g. with ethyl meth- 
ane sulfonate, can produce promoter mutations or 
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Arabidopsis Map-Based Cloning in the Pq&t-Gcnomt: Era 



1996 

Total effort: 

3 to 5 person-years 

Mixture of DNA 
And vfalbltf manners 



Build physical map 
(YACe or eosmids; 



Develop markers 
from YACa gr coarnJd* 



Kay steps in 
map-based 
cloning process 



First-pass mapping 
(20 cM resolution) 



physical map 



Flne-Bcaie mapping 
(40 Kbp resolution) 



No DNA Sequence for 
Oandidata gunes available 



Clone and complement 
then tte novo sequencing 



Consider 
candidate genes 



Final Identification 
of mutation 



2002 

Total effort: 
<;1 petson-yasr 

Standard molecular 
marker set la available 



Physical 
map exists 



Choose marker* from 
Coreon database 



Identify candidate genea 
from Col-0 saquenoe 



Design PCR prtnwns from 
CoKo. irteh sequanca 



Figure 1. Comparison of effort involved in map-b^sed cloning, Key 
steps that have; become caster between 1995 and 2002 are 
presented. 



mis-sense mutations in the coding region, resulting 
in a hypomorphic knock-down rattier than an amor- 
phic knockout of a protein function- Many interesting 
but essential genes have been found through such 
hypomorphic mutations. For instance, "leaky" muta- 
tions In VTC1 (CYT1) can result in ozone sensitivity 
and reduced vitamin C levels in Arabidopsis (Con- 
klin ct aL, 1999), but knockout mutations cause em- 
bryo lethality (Lukowitz et al, 2001). Key regulatory 
steps in biochemical pathways are often found 
through dominant point mutations that prevent feed- 
back inhibition of an enzyme, e.g. anthranilate syn- 
thase (Kreps ct aL, 1996; Li and Last, 1996) or Asp 
kinase (Heremans and Jacobs, 1997). Such dominant 
mutations would not be found by msertionai 
mutagenesis, 

Chemical mutagenesis, in addition to generating a 
greater diversity of mutations than insertional mu- 
tagenesis, also results in many more mutations in 
each individual plant. Plants mutagenized with 
T-DNA typically have only one to three insertions 
per line. Even in a best-case scenario (insertion of 
three T-DNAs per line in a completely random man- 
ner, which is not likely), more than 1.00,000 plants are 
needed for a 95% likelihood of having a mutation in 
a given gene of average size. Screening this many 
plants can be prohibitive if the mutant screen being 
performed is laborious or slow. In contrast, ethyl, 
methane sulfonate mutagenesis typically introduces 
dozens of mutations in each plant line, and it is 
generally possible to find a mutation in any given 
gene by screening fewer than 5,000 plants (Feldrnan 
et aL, 1994). 

The techniques of map-based gene identification 
are also essential for the identification of the genetic 
basis of phenotypic variation among Arabidopsis 
ecotypes {natural isolates). The genomes of Arabi- 

Plant Physiol, Vol 3.29, 2002 



dopsis ecotypes differ from one another at many 
thousands of locations and represent a Level of ge- 
netic variation that is not achievable in the laboratory 
(Alonso-Blancp and Koornneef, 2000). Hundreds of 
ecotypes collected from around the world are avail- 
able to researchers through Arabidopsis stock centers 
(http:/ /www.Arabidopsis.org/abrc; http:/ /nascnott 
ac.uk), Phenotypic variation for almost any trait oJf 
interest can be found in progeny of crosses made 
between these ecotypes. In many oases this variation 
is due to the effects of several genes and is quantita- 
tive in nature. Statistical methods developed in the 
1990s (Haley and Knott, 1992; Jansen, 1993; Zeng, 
1994) and the availability of an almost unlimited set 
of genetic markers (see below) make it feasible to 
map and clone such quantitative trait loci (QTL). We 
will not describe QTL mapping here, but other recent 
reviews have covered this subject (Kearsey and Far- 
quhar, 1.998; Alonso-Blanco and Koornneef, 2000; 
Vano, 2001). 

In this paper, we present a large set of DNA mark- 
ers identified at Cereon Genomics, we describe how 
these markers can be applied to a generic map-based 
cloning project, and we introduce the VTC2 gene as 
an example of a specific mapping project, 

THE CEREON ARABIDOPSIS 
POLYMORPHISM COLLECTION 

Positional cloning of genes in Arabidopsis is 
greatly facilitated by the recent sequencing of Col-0 
and Landsberg erecta {her). These two ecotypes were 
sequenced because they are among the most com- 
monly used ecotypes in Arabidopsis research- 
George Redei, one of the founders of modern Arabi- 
dopsis genetics, began working with Col and her in 
the 1950s (Redei, 1992). Since then, they have been 
the subjects of literally thousands of papers that have 
been published on the genetics, molecular biology, 
and biochemistry of Arabidopsis. Col-0 and Ler arc 
also the paren ts of a widely used collection of recom- 
binant inbred lines (Lister and Dean, 1993), Hun- 
dreds of markers have been analyzed, in these lines, 
and the genetic map produced from this work has 
become the standard against which other Arabidop- 
sis genetic maps are aligned. 

The Col-0 ecotype was the subject of a large inter- 
national sequencing project, which has produced a 
nearly complete sequence using a clone by clone 
approach (The Arabidopsis Genome Initiative, 2000). 
This high-quality sequence (less than, one error in 
10,000 bp) is a permanent resource for all future 
Arabidopsis sequencing efforts. Partial genomic se- 
quence data generated from other ecotypes can be 
positioned on the framework of CoLO genome se- 
quence. Sequencing of individual genes from mu- 
tants or from other ecotypes has become routine; it is 
simply a matter of designing PCR primers based on 
the Col-0 sequence, amplifying the desired gene, and 
sequencing the product. 
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The her ecotype was the subject of a very different 
genome sequencing effort, low coverage shotgun se- 
quencing at Cereon Genomics. This project generated 
approximately 700,000 500-bp sequence traces. Of 
these, more than 200,000 were chloropiast, mitochon- 
drial, or ribosomal DNA and were not used for the 
assembly. This left 498,037 traces totaling 263 Mbp of 
good quality raw sequence, representing approxi- 
mately 2-fold coverage of the Arabidopsis genome. 
Assembly of the sequences produced 50,262 contigs 
(average size, 1.5 kb) and 31,044 single-read se- 
quences. The size of the assembled dataset totaled 
92.1 Mbp, suggesting that approximately 70% of the 
genome is covered at the nucleotide level. To assess 
the coverage at the gene level, more than 2,000 cDNA 
sequences from GenBank were extracted and 
searched against the Ler shotgun dataset using the 
BLASTn algorithm (Altschul et al., 1990), A total of 
96.5% of the cDNAs were at least partially detected 
using a 95% identity cutoff, indicating that at least 
some sequence from over 95% of all genes is present 
in the data assembled from the low coverage shotgun 
approach. 

Tor Arabidopsis researchers who are interested in 
map-based cloning, the value of two genome se- 
quences greatly exceeds that of only one such se- 
quence. Whereas the availability of the genome 
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sequence of a single ecotype mainly facilitates DNA 
sequencing in the final stages of a mapping project 
(Fig. 1), data from two genomes make it possible to 
develop a database of DNA polymorphisms that can 
be used as genetic markers. A high-density map of 
DNA markers greatly facilitates fine-scale genetic 
mapping. To generate such a map, we compared 
{stretches of Ler shotgun sequence with Col-0 
genomic sequence determined from, cloned bacterial 
artificial chromosomes (BACs; we will refer to ail 
large DNA clones sequenced by the Col-0 genome 
project collectively as BACs). Differences between 
the ecotypes were classified into two types: single 
nucleotide polymorphism (SNP) changes, which alter 
a single nucleotide present at specific location in the 
genome (Pig. 3), and insertion-deletion (TnDel) differ- 
ences, where one ecotype has an insertion of a num- 
ber of nucleotides relative to the other (Fig. 3). 

To detect SNPs and InDels, one must be able to 
accurately predict true polymorphisms against a 
background of sequencing errors, This is of particular 
concern, for the Ler data, which are unedited shotgun 
sequence, in contrast to the high quality "finished" 
Coi-0 sequence. To increase the likelihood of detect- 
ing real ecotypic differences, fairly stringent criteria 
were applied to a single base difference before calling 
it a bioinformatically predicted SNP. The aligned 
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Col-0 AACATTCCTCAAGTTTGGTTA 

■ IIIIMMII 1 1 II I M 1 1 1 

Ler AACATTCCTCTAGTTTGGTTA 
SNP 442795 



ATTTTCTA 



350 




CoI-0 TTGATTTCT^fTAAAGTAACT 

, 1 1 M i 1 1 1 1 1 1 1 1 1 II II M 

Ler AGTGAGGGGTCTACCTCTGC 
InDei 448516 

Figure 3. Examples of SNP and intfel polymorphisms. Two markers 
from the Cereon Arabidopsis Polymorphism Collection are shown 
Marker 442*95 hcis a single-nucleoticte change from A to T, whereas 
marker 44B5 16 has an eight-nucleotrde insertfon fn Col-0 versus Ler, 

region of Ler and Col-0 sequence had. to be longer 
than 200 bp and to include more than 75% of the 
length of the Ler sequence. In addition, the polymor- 
phic base must be unambiguous in Ler, covered by at 
least two reads, and be greater than 25 bp from any 
single coverage region. The quality of the local se- 
quence must be high: The SNP-con taming base must 
have a phrap consensus quality score (Green, 1996 
Version Q.980S12, downloaded 1999) of at least 40, 
and the surrounding 25 nucleotides must have con^ 
sensus scores of at least 30. Re-sequencing of the Ler 
allele of a representative sample of SNP$ predicted in 
this way showed that the success rate was close to 
100%, Single-basepair InDels were found using the 
same methods as those used for SNP prediction. Less 
stringent criteria were applied for the detection of 
larger InDels, A gapped alignment between Ler and 
Col-0 was required to be greater than 90% identical 
over the matched region, with an insertion of at least 
2 bp in either Col-0 or Ler, Unlike with SNP poly- 
morphisms, we did not confirm a representative 
sample of predicted InDels by resequencing the Let 
allele. Given the less stringent selection criteria, the 
error rate for predicted InDel polymorphisms is 
likely to be higher than the error rate for predicted 
SNP polymorphisms. 

At the time of writing, sequence for 1,501 Col-0 
BACs representing 123 Mbp of Col-0 genome se- 
quence had been compared against the assembled Ler 
shotgun sequence. This resulted in the identification 
of 37,344 SNPs, 18,579 small InDels (less than or 
equal to 100 bp), 747 large InDels (larger than 100 
bp), or a total of 56,670 polymorphisms. On average, 
there is one bioinformatically predicted SNP every 
3.3 kb and one predicted InDel every 6.6 kb. The 

Plant Physiol. Vol. 129, 2002 




35 4S SS 65 75 

SNPs or InDels p*r BAG 



85 95 



Figure 4. Frequency of SNPs 9 nd InDels by SAC, A total of 56,668 
SNP and InDel polymorphisms between Col-0 and Ler were identi- 
fied. These polymorphisms were assigned to 1 ,501 sequenced Col-0 
Bac; clones (The Arabidopsis Genome Initiative, 2000) Data are 
presented a? bins of s, j,e. 1 to 5 polymorphisms/BAC, 6 to 10 
polymorphisms BAC, etc. Nineteen 6ACs have no predicted InDel or 
SNP polymorphisms. 

SNPs and InDels are distributed, throughout the ge- 
nome, with most BACs having several polymor- 
phisms that could be used for genetic mapping (Fig, 
4), Because of the stringent selection criteria and the 
partial Ler sequence, these numbers represent an un- 
derestimate of the true frequency of SNP and InDel 
differences thai exist. For instance, a screen of 500 
kb of Arabidopsis sequence by denaturing HPLC 
(DHPLC) found polymorphisms at a frequency of 
close to one per kilobasepair (Cho et aL, 1999). The 
Cereon Arabidopsis Polymorphism Collection is 
made available to registered users at non-profit and 
educational institutions for non-commercial research. 
Access is obtained by one-time registration through 
The Arabidopsis Information Resource Web site 
(http://www.arabidopsis.org/cereon/). At the time 
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Figure 5. Insertions in Col-0, A total of 10,578 insertions in Col-0 
relative to Ler were identified. Insertion size data are presented as 
bins of 0.3 log 10 (no. of basepafrs), I.e. !og TU (no. of basepairs) < 0.3, 
0.3 < Jog lCl (no. of basepairs) < 0.6, eta. The median (4 bp) and mean 
(175 bp) insertion sites are indicated. 
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of writing, 890 researchers from 40 countries had 
registered to use this database. 

The five chromosomes of Arabidopsis have ap- 
proximately equal densities of SNP polymorphisms, 
Not surprisingly, SNP frequency varies between ex- 
ons and mtrons, with one SNP every 3.1 and 2 2 kb 

iV^ lV £ el l T ^ 5itions < A / T *> G/C) account for 
^/.S/o of the SNPs, and transversions occur with 
frequencies of 17.4% (A/T to T/A), 23.0% (T/A to 
G/C), and 7,9% (C/G to G/C). There is no Col-0 or 
Ler bias m the directionality of the transitions or 
fcransversions. 

InDel polymorphisms between Col-0 and Ler range 
from 1 bp to greater than 38 kb. Due to the average 
1 .5-kb contig sise of the Ler shotgun sequence, large 
insertion? can only be detected in the Col-0 back- 
ground and not in the Ler background. Insertions in 
Col-0 relative to Ler have an average size of 175 bp 
and a median size of 4 bp (Fig. 5). Approximately 
10% of the TnDels were associated with polymor- 
phisms in the length of simple sequence repeats that 
were identified with the Sputnik program (Abajian, 
1994, downloaded 1999), but most were found in 
non-repetitive sequences. Most InDels (93%) are 
smaller than 100 bp, making them suitable for PCR- 
based marker detection methods (see below). 

The Cereon Col-0/L<?r SNPs and InDels sequences 
should be very informative for discovering polymor- 
phisms between other ecotype pairs. If one assumes a 
random genetic reassortment of polymorphisms 
among Arabidopsis ecotypes, then 50% of the CoH)/ 
Ler polymorphisms should be useful for genetic map- 
ping in any other pair of ecotypes. Work done with 
amplified fragment length polymorphism (AFLP) 
markers, which generally are due to underlying 
SNPs, indicates that there is such a random assort- 
ment of polymorphisms. Approximately 50% of Col- 
0/Ler AFLP polymorphisms can. also be used for 
segregation analysis in Col-0/C24, Coi-0/Was- 
silewgWJ, or Col-0/ Cape Verde Islands crosses (Pe- 
ters et al., 2001). Analysis of 79 AFLP markers in 142 
ecotypes shows a high degree of recombination in the 
evolution of these ecotypes, such that it is not possible 
to draw an "ecotype phylogeny" (Sharbel et al., 2000). 
Thus, the Cereon Arabidopsis Polymorphism Collec- 
tion will be useful for mapping QTLs or mutations in 
most and perhaps all other pairs of Arabidopsis 
ecotypes. It is a relatively minor disadvantage that 
one-half of all attempted markers will fail and the 
average marker density is reduced by 50%, i.e. one 
SNP every 6,6 kb instead of one SNP every 3.3 kb 

Overall, the density of both SNP and InDel markers 
is high enough that it is theoretically possible to map 
most mutations within a few thousand basepairs us- 
ing either type of marker in any combination of 
ecotypes. The availability of genetic markers is no 
longer the limiting factor for the fine-scale genetic 
mapping needed for map-baaed cloning in Arabidop- 
sis. Instead, this process is limited by our ability to 
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generate recombination events at a high enough den^ 
sity and to rapidly and inexpensively senotvoe 
plants using these markers. y ^ 



METHODS FOR PETECTION OF 
DNA POLYMORPHISMS 

A critical aspect of map-based cloning is the ability' 
to accurately detect DNA markers at an appropriate 
cost and throughput. In the past few years, a number 
of new technologies for high-throughput detection of 
DNA polymorphisms have been developed. Most of 
these advances were driven, by the field of human 
genetics, but all of the methods can be applied 
equally well to plant systems. Because they tend to 
require a relatively large initial investment, these fast 
and highly automatable methods are best suited to 
research settings where large numbers of genotypes 
need to be determined in a short period of time and 
with minimal human intervention. 

Because SNPs are more common than InDels in 
biology and are more amenable to automation strat- 
egies, most high-throughput genotypmg approaches 
are designed for SNP rather than InDel detection. 
Oligonucleotide arrays (Gene Chips) contain thou- 
sands of oligonucleotides annealed to a glass slide 
Such arrays allow the detection ' of SNP polymor- 
phisms by differential hybridization in a highly par- 
allel and automated manner (Lipshutz et a.L, 1999). 
The Taq-Man PCR assay is designed to detect SNPs 
in a high-throughput manner through the release of 
fluorescent reporter dye from a quencher on the same 
oligonucleotide by 5' nuclease activity (Livak, 1999), 
By using more than one reporter dye, it is possible to 
detect different alleles of a SNP in a single reaction. 
The relatively high price of oligonucleotides tagged 
with reporter and quencher dyes makes this method 
cost-effective only if a large number of reactions need 
to be run with each SNP marker. In pyrosequencing, 
an enzymatic cascade and. luminometric detection 
system is used to measure the pyrophosphate that is 
released as a result of nucleotide incorporation (Ah« 
madian et al., 2000; Alderborn et al, 2000). Because 
20 or more nucleotides are determined by this 
method, it is possible to detect several closely linked 
SNPs at once. The pyrosequencing method can be 
automated but has the disadvantage that it does 
SS^ rk Wel1 on stre * ches of repeated nucleotides. 
DHPLC allows the detection of SNPs through differ- 
ent retention time of hetetoduplex and homoduplex 
DNA in reversed-phase HPLC under partially dena- 
turing conditions (Spiegelman et ah, 2000). DHPLC 
allows detection of SNP polymorphisms in PCR- 
ampHfied DNA up to about 1,000 bp in size. Al- 
though not inherently high-throughput, DHPLC 
lends itself nicely to bulked segregant analysis. The 
method of fluorescence resonance energy transfer 
combines PCR and oligonucleotide ligation to detect 
SNPs (Chen et al., 1998). Dye-labeled oligonucleotide 

Ffant Physiol, Vol. 1.29, 2002 



11/08/02 FRI 18:11 FAX 18605725240 



DEKALB 



0040 



probes are used in this assay, and allele-specihc liga- 
tion is detected by fluorescence resonance encrey 
transfer, which only occurs when two dye-labeled 
oligos are joined by ligation. Matrix-assisted laser 
desorphon /ionization time-of-flight mass spectrom- 
etry can be used to rapidly detect SNPs in short DNA 
pieces by differences in molecular mass (Wada and 
Yamamoto, 1997). 

A disadvantage of most high-throughput methods 
for detecting DNA polymorphisms is the high initial 
equipment cost, which results in a high per-assay 
cost for a lab that does not need to perform large 
numbers of genotyping reactions on a routine basis. 
In contrast, both InDels and SNPs can be detected 
using gel-based methods, which have a relatively 
low start-up cost and moderate throughput. InDcls of 
small to moderate size can be detected by PCR am- 
plification and gel elcctrophoretic separation (Bell 
and Ecker, 1994). Pairs of PCR primers are designed 
to amplify a segment of DNA spanning the InDel, 
and size differences in the amplified products arc 
detected using either agarose or acrylamide gels. 
Agarose gels are easier and. less expensive to use, but 
size differences of less than 5 bp are difficult to detect 
reliably, Acrylamide gels, on the other hand, give 
single-basepair resolution and allow the detection of 
even very small InDels- In either case, InDels are 
scored as eodominant markers with one band seen on 
the gel for either homozygous class and two bands 
seen for heterozygous individuals. To reduce the 
number of PCR reactions needed for a mapping 
project, it is possible to pool DNA samples for bulked 
segregant analysis (Michelmore et al., 1991; Lukowitz 
etaL, 2000), or multiple primer pairs can be added to 
one reaction tube to amplify several markers at once 
(Ponce et al., 1.999). 

Several gel electrophoresis-based strategies for de- 
tecting SNP markers have been devised. Many SNPs 
alter sites cleaved by restriction enzymes and can be 
used as cleavcd-amplified polymorphic sequence 
(CAPS; Konieczny and Ausubel, 1993) markers. 
CAPS markers are amplified by PCR, the amplified 
DNA is cleaved with the appropriate restriction en- 
zyme, and the cleavage products are examined on 
agarose gels. Just as with InDels, such markers are 
eodominant, allowing the differentiation of heterozy- 
gotes and either homozygotc class. If there is no 
suitable restriction site at a SNP, it is possible create 
a site during PCR amplification with suitably de- 
signed primers (dCAPS [Michaels and Amasino, 
1998; Ncff et al., 1998]). Disadvantages of using CAPS 
and dCAPS for genotyping include the extra time 
and cost involved in the restriction enzyme digestion 
and the possibility of a false result attributable to 
incomplete digestion by the restriction enzyme- 
It is also possible to detect SNPs using allele- 
specific PCR primers, where the 3' end of a primer 
has a perfect match with one allele and a mismatch 
with the other allele (Ugozzoli and Wallace, 1991). In 
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theory, such primers can be used to preferentially 
amplify one allele of a SNP, but in practice a single- 
basepair change is often not enough to allow reliable 
differentiation between the two alleles of an SNP 
(Kwok ct al v 1990; Cha et al, 1992). A modification of 
the allele-specific amplification procedure (single nu^ 
cleohde amplified polymorphism [SNAP]) has re- 
cently been described (Drenkard ct al., 2000). In this 
method, additional mismatches are introduced in the 
amplifying primers to maximize the difference in 
™P, ^'cation efficiencies of the two alleles of the 
SNP, Primer basepair changes that allow differential 
S?? fc£5 ation of S W site s can be predicted using the 
SNAPER program. Both the SN AFER program and a 
collection of primers that have been used success- 
fully to amplify Arabidopsis SNAP markers can be 
found at http://patho.mgh. harvard.edu/ausubej- 
web. As with CAPS, SNAP markers are eodominant 
and can be detected on agarose gels. However, it is 
necessary to run two PCR reactions—one for each 
allele of the SNP— to get complete SNAP genotyping 
data, 

The detection of SNPs and InDels is an essential 
part of the map-based cloning process. Because 
marker discovery is no longer a problem in Arabi- 
dopsis, the selection of an efficient genotyping plat- 
form plays a critical role in the mapping timeline that 
we describe in the next section. We have mentioned 
several commonly used genotyping methods, and 
the choice of which method to apply will depend on 
the resources of an individual laboratory and the 
number of genotyping reactions that will need to be 
performed. 



MAP-BASED CLONING PROCESS 

Given a sequenced genome and a dense collection 
of genetic markers, map-based cloning becomes a 
relatively straightforward process, Figure 2 illus- 
trates a time-efficient approach to map-based cloning 
in Arabidopsis, a variant of the "chromosome land- 
ing" method proposed, by Tanksley et aL (1995). 
Starting with a mutation in the Col-0 or Ler back- 
ground, it is possible to proceed from having a mu- 
tant plant to identifying the affected gene in approx- 
imately 1 year. The overall length of this cloning 
process is dictated largely by the fact that it incorpo^ 
rates five cycles of plant growth (we assume 2 
months/cycle). 

As a first step in the mapping process, the mutant 
is out-crossed to the opposite ecotype (Col-0 or Ler). 
In most cases, it is not necessary to "clean up" the 
genetic background of the mutant by back-crossing 
and it does not matter whether the mutant is used as 
the male or the female parent in the out-cross. F a 
seeds are planted and, as the plants are growing, it is 
possible to perform phenotype and genotype analy- 
sis. Presence or absence of the phenotype in the 
generation, will suggest whether the mutation of in-' 
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terest is likely to be dominant or recessive We rec- 
ommend genotyping the Fj plants with a few mark- 
ers to make sure that they are truly heterozygous and 
that there was no mistake made during the cross, 
similarly, it is worthwhile to genotype the original 
mutant to make sure that it is in the presumed 
ecotype background. Contamination with other 
ecotypes is a surprisingly frequent cause of "mu- 
tants" that arise in screens. 

F 2 seeds are collected from self-pollination of the F 2 
plants, and a population of approximately 600 indi- 
viduals is planted for first-pass mapping (Fig. 2). As 
they are growing, the phenotype of the F 2 P^nts is 
determined, unless the trait can only be scored in the 
progeny (F 3 ) seed- It should be possible to identify 
approximately 150 plants in this population as ho- 
mozygous; homozygous mutant in the case of a re- 
cessive mutation or homozygous wild type in the 
case of a dominant mutation. DNA for genotype 
analysis is prepared from the leaves or other tissue of 
these 150 plants. Initially,, the 150 plants are geno- 
typed with 25 markers, spaced roughly every 20 
centiMorgan (cM) apart on the five chromosomes. 
Genetic linkage to one or more of the 25 markers is 
determined and a three-point cross is used to define 
a 20-cM interval that contains the gene of interest. 
Once a 20-cM interval has been found, additional 
markers are used to narrow down the region of in- 
terest to approximately 4 cM Given a population of 
150 plants, it should be possible to determine this 
4-cM interval with a high degree of certainty. The 
two markers closest to the mutation on either side 
will be used as flanking markers in further work. 

Next, it is necessary to plant a larger F 2 population 
for fine-resolution mapping (Fig. 2). The ultimate 
goal of fine mapping is to narrow down the region 
containing the gene of interest to 40 kb or less (ap- 
proximately 0.16 cM genetic distance in Arabidop- 
sis). There would ideally be several recombination 
events in this interval to define the position of the 
mutation that is being mapped. Unfortunately, the 
number of F 2 plants needed to have a 95% chance of 
recombination events in a given genetic interval in- 
creases rapidly as the size of the interval decreases 
(Fig. 6). We recommend having a fine mapping pop- 
ulation of 3,000 to 4,000 plants (including the original 
600 lines grown for first-pass mapping) to give a high 
probability of mapping the gene of interest to less 
than 40 kb. In areas of the genome with reduced 
meiotic recombination, e.g. near the centromeres, 
larger F 2 populations will be necessary to map a 
mutation to an equivalent physical interval on the 
chromosome. Many Arabidopsis mapping projects 
have been successful with fewer than 3,000 to 4,000 
F s plants (Lukowitz et a!., 2000), but when planting 
fewer plants one runs the risk of extending the map- 
ping timeline by having to plant an additional F 2 
population later on. 
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Figure 6. Number of plants needed to find recombinants. The curves 
show the number of F 3 plants needed to have* 3 95% chance of 
finding at least one plant <T), <it least three plants (3), or at least five 
plants (5) With recombination events in a given physical interval of 
DNA, The calculations assume an average 250 kb/cM for Arabidop- 
sis (Lukowitz et al., 2U0D), The possibility of multiple recombination 
events in one individual plant has a negligible effect and is not 
included in the calculations. 

At this point plants that are recombinant in the 
4-dvl interval determined by first-pass mapping are 
sought for use in fine mapping. DMA is isolated from 
the mapping population of 3,000 to 4,000 plants and 
the genotype of the two flanking markers is deter- 
mined. This should identify 200 to 300 plants that 
have genetic recombination events in the region of 
interest (Fig. 2). The allelic state of the mutation being 
mapped (homozygous mutant, homozygous wild 
type, or heterozygous) in these recombinant plants is 
determined by looking at the phenotype In a repre- 
sentative sample of progeny in the F 3 generation. 
Additional markers in the 4-cM interval are used to 
look for increasingly tight linkage to the mutation. In 
most cases, it should be possible to define a pair of 
markers flanking the mutation that are less than 40 
kb apart. 

Once an interval of less than 40 kb containing the 
mutation of interest has been determined, this entire 
region is sequenced to find the mutation. In theory, it 
is possible to map a mutation to the single-gene level 
using the Cereon Arabidopsis Polymorphism Collec- 
tion, but the number of F 2 plants needed to find 
recombinants in such a small, interval would be very 
large (Fig. 6). It is faster and less expensive to se- 
quence a larger interval. Because the sequence of the 
Col-0 genome is known, one efficient way to se- 
quence the mutant region is to design PGR primers to 
amplify overlapping segments of about 500 bp span- 
ning the entire 40 kb. These segments are then se- 
quenced and assembled, the sequence is compared 
with that of a wild-type plant (Col-0 or Ler), and the 
mutation is identified. In the case of a mutation in 
the Ler background, i t is necessary to a bo sequence 
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the her wild type for comparison at every location 
where a difference to the wild-type Col-0 is found. In 
the case of a mutation in Col-0, a published sequence 
is available. However, it is necessary to confirm that 
any nucleotide that diverges from the published 
Col-0 sequence was induced by the mutagenesis 
treatment and is not present in the wild-type progen- 
itor strain. This is because strain to strain differences 
exist in "Col-0 wild type/' and even at the high 
quality standard of the Col-0 sequence, sequencing 
errors are expected and found. 

APPLICATION OF CEREON MARKERS TO 
CLONING VTC2 

The identification of the VTC2 gene is a specific 
example of a map-based cloning project using the 
Cereon Arabidopsis Polymorphism Collection. The 
vtc2-l mutation was isolated in a screen for ozone- 
sensitive mutants of Arabidopsis (Conkltn et aJ v 
1996), Further work showed that this mutant was 
deficient in. ascorbic acid (vitamin C), and an addi- 
tional three alleles (vtc2-2, vtc2-3, and vtc2-4) were 
isolated based on this phenotype. A first-pass map 
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position for the vtcl-1 mutation between CAPS mark- 
ers WU95 (74 cM) and PRHA (78 cM) on chromo- 
some 4 was reported (Conklin et al v 2000) 
^2? e £^ PS ™ rh *T* WU95 and PRHA are relatively 
difficult to score. Instead, we used the nearby InDel 
markers (449235 and 450577 from the Cereon Arabi- 
dopsis Polymorphism Collection) as flanking mark- 
ers for fine mapping (Fig. 7 A). These markers are 
approximately 980 kb apart on chromosome 4. DNA 
segments spanning these markers were amplified by 
PGR, and the amplified products were detected by 
PAGE, A population of 3,700 Col-0 vtc2~2 X Ler 
plants was analyzed with markers 449235 and 
450577. A total of 52 recombinants were identified 
and confirmed by repeating the genotyping with the 
same markers in the F 3 generation. Trie number of 
recombinants is considerably less than one would 
expect given the genetic separation previously re- 
ported for the CAPS markers WU95 and PRHA (4 cM 
apart, expected approximately 280 recombinants). 
We do not have a good explanation for this observa- 
tion, but it does illustrate the utility of generating a 
mapping population that is larger than the theoreti- 
cal minimum needed. 
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Additional markers between 449235 and 450577 
were chosen from the Cereon Arabidopsis Polymor- 
phism Collection (Fig. 7B) for fine mapping. All 52 
recombinants were genotyped with these 21 markers 
to narrow down the positions of the recombination 
events. Pieces of DMA containing the marker of in- 
terest were amplified by PGR, and the polymor- 
phisms were detected by PAGE (for InDels) or DNA 
sequencing (for SNPs). Vitamin C levels of individual 
F 3 progeny (at least 20 per line) were measured to 
determine whether the 52 F 2 recombinants were ho- 
mozygous mutant, homozygous wild type, or het- 
erozygous at the VTC2 locus. This information was 
combined with the marker data to identify markers 
424439 and 424446, which are contained in BAC FlO 
M23 (GL4756963), as the closest markers flanking the 
mutation. 

Markers 424439 and 424446 are approximately 20 
kb apart, In the Col-0 genomic sequence, there are 
nine predicted genes in this region (Fig. 7C), but none 
are annotated as enzymes of the proposed Wheeler- 
Smirnoff Pathway for vitamin C biosynthesis in 
plants (Wheeler et al v 1998). We designed primer 
pairs to amplify overlapping segments of DNA span- 
ning the 20-kb region from the vlc2-2 mutant. Se- 
quencing of these fragments and comparison with 
the wild-type CoI-0 sequence identified a mis-sense 
change in the puta tive gene F10M23.190 (GI:7452423; 
Fig. 7D), resulting in a Gly to Asp change in the 
predicted exon 5 (new GenBank ID AF508793), This 
gene was also sequenced from the three other vtc2 
mutants. A mis-sense mutation was identified in 
vtc2-3 (Fig. 7D), resulting in. a Ser to'Phe change in 
the predicted exon 6. Both vtc2-l and vtc2-4 had the 
same mutation, which changed the 3' splice site of 
the predicted intron 5 from AG to A A (Fig. 7D). 
These last two mutations are almost certainly inde- 
pendently generated, because one was isolated in 
wild -type Col-0 and a the other was from a strain of 
Col-0 carrying a PATl-GUS transgene (Rose and 
Last, 1997). Together, these four mutations show that 
putative gene F10 M23.190 is VTC2. As additional 
confirmation, all four mutant alleles of VTC2 were 
complemented using genomic clones of F10M23-190 
isolated from Col-0 by PCR (L Levin and S. Norris, 
unpublished data). 

The F10M23.190 gene {VTC2) was previously an- 
notated as an undefined protein (GL7452423; Mayer, 
1999). The most similar proteins in the GenBank da- 
tabase are as follows: Arabidopsis protein MC015.7, 
Caenorhabiiis elegans protein C10F3.4, and fruitfly 
(Drasophila tnetemogaster) protein CG3552, none of 
which have a demonstrated function. Thus, although 
we have a phenotype associated with mutations in 
VTC2, the regulatory or biosynthetic pathways lead- 
ing to the reduced vitamin C levels in these mutants 
remain to be discovered. 



DISCUSSION 

We have outlined a map-based cloning strategy, 
which leads to the identification of an Arabidopsis 
mutation, in a straightforward manner in approxi- 
mately 1 year. Our timeline assumes that it is possi- 
ble to determine the phenotype of F 2 plants ai they 
are growing. If the phenotype of interest is measured 
on seeds (i.e. F 3 seeds from F 2 plants), then the map- 
ping time will be increased by 3 months. The strategy 
that we propose is designed to minimize the number 
of plants that have to be subjected to phenotypic 
analysis. In most cases, DNA based markers can be 
determined faster and more accurately than individ- 
ual plant phenotypes. Obviously, if phcrtotyping is 
easier than genotyping, this procedure can be 
changed by identifying a large number of homozy- 
gous mutant, or wild type in the case of dominant 
mutations, plants and genotyping these alone. 

Modifications of the process that we have outlined 
can speed up the mapping timeline. In. many cases, as 
the mapping region is narrowed down, candidate 
genes become obvious, and it is possible to shift to 
sequencing at any stage during the process (Fig, 2). 
For rare examples of very reliable phenotypes, it may 
not be necessary to grow an F 3 generation for prog- 
eny testing, thus, shortening the timeline by 2 
months. It is also possible to grow a single large Fo 
population, rather than two sequentially grown pop- 
ulations (first-pass mapping and fine-scale map- 
ping), However, this may result in wasted effort 
because some mutations are recalcitrant to genetic 
mapping. Situations that can make a given mutation 
difficult or impossible to map include: QTL variation 
for the trait of interest in the Col-0/ her F z population, 
phenotypes caused by multiple mutations, sensitivity 
of the phenotype to environmental, variation in the 
greenhouse or growth chamber, and non-nuclear 
mutations. 

The mapping timeline that we have outlined de- 
pends on the ability to rapidly genotype large num- 
bers of plants. Tt may be difficult to maintain this 
timeline by using gel-based methods for SNP and 
InDel detection. High-throughput SNP detection 
methods are available, but they involve a high initial 
equipment cost that could make them prohibitive to 
set up and use in an individual laboratory. One so- 
lution to this problem may be for universities or 
academic departments to set up genotyping centers, 
similar to those that currently exist for DNA sequenc- 
ing. Similar to a DNA sequencing center, a genotyp- 
ing center could, serve a large number of researchers 
working in all areas of molecular genetics. 

The current rate-limiting step for map-based clon- 
ing in Arabidopsis is the number of F 2 plants that 
must be analyzed to find recombinants in a suffi- 
ciently small interval of DNA. There are no known 
methods for increasing meiotic recombination fre- 
quency in Arabidopsis (or any other plant). How- 
ever, both ecotype-specific variation (Earth ct al, 
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2001) and mutations that decrease meiotic recombi- 
nation frequency (Massou and Paszkowski, 1997; 
Grelon et aL, 2001) have been reported. It is plausible 
that it will be possible to selectively alter meiotic 
recombination frequency at some point in the not too 
distant future by crossing QTLs from other ecotypes 
into standard laboratory strains, by overexpressing 
proteins necessary for elevated meiotic recombina- 
tion, or perhaps by physical or chemical treatments 
that increase the recombination rate. 

Sequencing of the Arabidopsis genome, the avail- 
ability of the Cereon Arabidopsis Polymorphism Col- 
lection, and advances in the methods used for UNA 
marker detection, have made map-based cloning of 
mutations in Arabidopsis a routine process. Mutation 
mapping will play a central role in the process of 
assigning a function to the thousands of plant genes 
that currently are known only a$ predicted open 
reading frames. Given the advantages of map-based 
cloning that we have outlined in the introduction, 
this is a viable approach for gene discovery that can 
be used in any laboratory. 
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